[jira] [Closed] (TIKA-3149) Tikka 1.18 not working with tess4j 3.4.8 on linux

2021-04-23 Thread Konstantin Gribov (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov closed TIKA-3149. --- > Tikka 1.18 not working with tess4j 3.4.8 on linux >

[jira] [Updated] (TIKA-3149) Tikka 1.18 not working with tess4j 3.4.8 on linux

2021-04-23 Thread Konstantin Gribov (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov updated TIKA-3149: Description: I am using tikka 1.18 version to parse the docuemtn content. It is working ind

[jira] [Resolved] (TIKA-3149) Tikka 1.18 not working with tess4j 3.4.8 on linux

2021-04-23 Thread Konstantin Gribov (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov resolved TIKA-3149. - Assignee: Konstantin Gribov Resolution: Not A Bug > Tikka 1.18 not working with tess

[jira] [Commented] (TIKA-3149) Tikka 1.18 not working with tess4j 3.4.8 on linux

2021-04-23 Thread Konstantin Gribov (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17331122#comment-17331122 ] Konstantin Gribov commented on TIKA-3149: - You have both slf4j-jdk14 (logger imple

[jira] [Updated] (TIKA-3369) Flaky Tesseract OCR confirmMultiPageTiffHandling test

2021-04-23 Thread Konstantin Gribov (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov updated TIKA-3369: Description: Current main@08793d360a838db04a3d23b902c34d9e6e7362e4 fails with {noformat} [E

[jira] [Created] (TIKA-3369) Flaky Tesseract OCR confirmMultiPageTiffHandling test

2021-04-23 Thread Konstantin Gribov (Jira)
Konstantin Gribov created TIKA-3369: --- Summary: Flaky Tesseract OCR confirmMultiPageTiffHandling test Key: TIKA-3369 URL: https://issues.apache.org/jira/browse/TIKA-3369 Project: Tika Issue

[RFC] Tika BOMs/platforms

2021-04-23 Thread Konstantin Gribov
Hi, folks. I hope for comments and kind of lazy consensus. If there would be no objections I'll merge it to main and branch_1x. I created tika-bom modules with bill-of-materials (in Apache Maven terminology) / platform (for Gradle users). It will allow easy Tika module versions alignment and to w

[jira] [Commented] (TIKA-3368) Add Bill of Materials (BOM) artifact (Tika 1.x)

2021-04-23 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17331107#comment-17331107 ] ASF GitHub Bot commented on TIKA-3368: -- grossws opened a new pull request #432: URL:

[GitHub] [tika] grossws opened a new pull request #432: [TIKA-3368] Add tika-bom module

2021-04-23 Thread GitBox
grossws opened a new pull request #432: URL: https://github.com/apache/tika/pull/432 Fixes #TIKA-3368 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about thi

[jira] [Commented] (TIKA-3367) Add Bill of Materials (BOM) artifact

2021-04-23 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17331102#comment-17331102 ] ASF GitHub Bot commented on TIKA-3367: -- grossws opened a new pull request #431: URL:

[GitHub] [tika] grossws opened a new pull request #431: [TIKA-3367] Add Bill of Materials (BOM)

2021-04-23 Thread GitBox
grossws opened a new pull request #431: URL: https://github.com/apache/tika/pull/431 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, pleas

[jira] [Created] (TIKA-3368) Add Bill of Materials (BOM) artifact (Tika 1.x)

2021-04-23 Thread Konstantin Gribov (Jira)
Konstantin Gribov created TIKA-3368: --- Summary: Add Bill of Materials (BOM) artifact (Tika 1.x) Key: TIKA-3368 URL: https://issues.apache.org/jira/browse/TIKA-3368 Project: Tika Issue Type:

[jira] [Updated] (TIKA-3367) Add Bill of Materials (BOM) artifact

2021-04-23 Thread Konstantin Gribov (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov updated TIKA-3367: Fix Version/s: (was: 1.27) > Add Bill of Materials (BOM) artifact >

[jira] [Created] (TIKA-3367) Add Bill of Materials (BOM) artifact

2021-04-23 Thread Konstantin Gribov (Jira)
Konstantin Gribov created TIKA-3367: --- Summary: Add Bill of Materials (BOM) artifact Key: TIKA-3367 URL: https://issues.apache.org/jira/browse/TIKA-3367 Project: Tika Issue Type: Improvement

[jira] [Resolved] (TIKA-3363) Have tika-docker artifacts start in spawn mode (configurable)

2021-04-23 Thread Lewis John McGibbney (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved TIKA-3363. Fix Version/s: (was: 1.27) Resolution: Won't Fix > Have tika-docker artif

[jira] [Closed] (TIKA-3363) Have tika-docker artifacts start in spawn mode (configurable)

2021-04-23 Thread Lewis John McGibbney (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney closed TIKA-3363. -- > Have tika-docker artifacts start in spawn mode (configurable) > --

[jira] [Created] (TIKA-3366) Retrospective release of tika-docker 2.0.0-ALPHA

2021-04-23 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created TIKA-3366: -- Summary: Retrospective release of tika-docker 2.0.0-ALPHA Key: TIKA-3366 URL: https://issues.apache.org/jira/browse/TIKA-3366 Project: Tika Issue

[INVITATION] Apache Tika container orchestration meetup

2021-04-23 Thread lewis john mcgibbney
Hi Folks, If you are interested in participating in a mini meetup based around Apache Tika container orchestration then please indicate your preferred availability at the Doodle Poll below. This community meetup focuses on Tika container orchestration (Docker, Docker Compose, Helm, Kubernetes, etc.

[jira] [Comment Edited] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread David Pilato (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330882#comment-17330882 ] David Pilato edited comment on TIKA-3364 at 4/23/21, 4:05 PM: --

[jira] [Comment Edited] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread David Pilato (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330882#comment-17330882 ] David Pilato edited comment on TIKA-3364 at 4/23/21, 4:04 PM: --

[jira] [Comment Edited] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread David Pilato (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330882#comment-17330882 ] David Pilato edited comment on TIKA-3364 at 4/23/21, 4:03 PM: --

[jira] [Comment Edited] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread David Pilato (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330882#comment-17330882 ] David Pilato edited comment on TIKA-3364 at 4/23/21, 4:03 PM: --

[jira] [Commented] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread David Pilato (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330882#comment-17330882 ] David Pilato commented on TIKA-3364: Oh my god! I'm feeling stupid. Anyway, I was not

[jira] [Created] (TIKA-3365) RTFParser to XMLContentHandler incorrectly interprets en dash.

2021-04-23 Thread Gordon Allen (Jira)
Gordon Allen created TIKA-3365: -- Summary: RTFParser to XMLContentHandler incorrectly interprets en dash. Key: TIKA-3365 URL: https://issues.apache.org/jira/browse/TIKA-3365 Project: Tika Issue

[jira] [Commented] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330851#comment-17330851 ] Tim Allison commented on TIKA-3364: --- try {{pdfParser.setExtractBookmarksText(false);}}

CFP for ApacheCon 2021 closes in ONE WEEK

2021-04-23 Thread Rich Bowen
[You are receiving this because you're subscribed to one or more dev@ mailing lists for an Apache project, or the ApacheCon Announce list.] Time is running out to submit your talk for ApacheCon 2021. The Call for Presentations for ApacheCon @Home 2021, focused on Europe and North America time zo

[jira] [Commented] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330827#comment-17330827 ] Nick Burch commented on TIKA-3364: -- I'm not sure if we already have outlines/bookmarks el

[jira] [Comment Edited] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread David Pilato (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330824#comment-17330824 ] David Pilato edited comment on TIKA-3364 at 4/23/21, 2:39 PM: --

[jira] [Commented] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread David Pilato (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330824#comment-17330824 ] David Pilato commented on TIKA-3364: So I trie this: {code:java} PDFPars

[jira] [Comment Edited] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread David Pilato (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330824#comment-17330824 ] David Pilato edited comment on TIKA-3364 at 4/23/21, 2:38 PM: --

[jira] [Commented] (TIKA-3324) Add checkstyle checker

2021-04-23 Thread Hudson (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330823#comment-17330823 ] Hudson commented on TIKA-3324: -- FAILURE: Integrated in Jenkins build Tika ยป tika-main-jdk8 #2

[jira] [Commented] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330810#comment-17330810 ] Tim Allison commented on TIKA-3364: --- We should probably add extra markup in the xhtml to

[jira] [Commented] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330809#comment-17330809 ] Tim Allison commented on TIKA-3364: --- You can see the text under the {{Outlines}} node.

[jira] [Updated] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-3364: -- Attachment: Screenshot from 2021-04-23 10-15-22.png > PDF Content is extracted twice > -

[jira] [Comment Edited] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330805#comment-17330805 ] Tim Allison edited comment on TIKA-3364 at 4/23/21, 2:13 PM: -

[jira] [Commented] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330805#comment-17330805 ] Tim Allison commented on TIKA-3364: --- {noformat} Dummy PDF file {noformat} > PDF C

[jira] [Updated] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-3364: -- Attachment: tika-bookmarks-config.xml > PDF Content is extracted twice > --

[jira] [Commented] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330799#comment-17330799 ] Tim Allison commented on TIKA-3364: --- The PDF contains bookmark text, which is what is tr

[jira] [Comment Edited] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330799#comment-17330799 ] Tim Allison edited comment on TIKA-3364 at 4/23/21, 2:08 PM: -

[jira] [Comment Edited] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330799#comment-17330799 ] Tim Allison edited comment on TIKA-3364 at 4/23/21, 2:08 PM: -

[jira] [Created] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread David Pilato (Jira)
David Pilato created TIKA-3364: -- Summary: PDF Content is extracted twice Key: TIKA-3364 URL: https://issues.apache.org/jira/browse/TIKA-3364 Project: Tika Issue Type: Bug Components: p