[jira] [Closed] (TIKA-3149) Tikka 1.18 not working with tess4j 3.4.8 on linux
[ https://issues.apache.org/jira/browse/TIKA-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov closed TIKA-3149. --- > Tikka 1.18 not working with tess4j 3.4.8 on linux > - > > Key: TIKA-3149 > URL: https://issues.apache.org/jira/browse/TIKA-3149 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.18 > Environment: linux and deployedo n weblogic >Reporter: Vishakha >Assignee: Konstantin Gribov >Priority: Blocker > Labels: starter > > I am using tikka 1.18 version to parse the docuemtn content. It is working > independently when deployed on linux but it is not working. If tessract is > used before it. It is giving below error while parseTostring > code : > Tika tika = new Tika();Tika tika = new Tika(); > try(InputStream stream = new > FileInputStream(Paths.get(documentPath.concat(documentName)).toAbsolutePath().toString())) > { String documentExt = > tika.detect(Paths.get(documentPath.concat(documentName)).toAbsolutePath().toString()); > String outputStr = tika.parseToString(stream); > String tempStr = outputStr.replace("\n", ""); _Logger.info("tempStr: " > +tempStr); } > catch (TikaException e) \{ > // TODO Auto-generated catch block _Logger.error("Error :",e); } > Error as : > java.lang.StackOverflowError > at > org.slf4j.impl.JDK14LoggerAdapter.fillCallerData(JDK14LoggerAdapter.java:602) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:587) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) > at > org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) > at > org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) > at java.util.logging.Logger.log(Logger.java:738) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) > at > org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) > at > org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) > at java.util.logging.Logger.log(Logger.java:738) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) > at > org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) > at > org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) > at java.util.logging.Logger.log(Logger.java:738) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) > at > org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) > at > org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) > at java.util.logging.Logger.log(Logger.java:738) > ... > > > kindly let us know the solution -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (TIKA-3149) Tikka 1.18 not working with tess4j 3.4.8 on linux
[ https://issues.apache.org/jira/browse/TIKA-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov updated TIKA-3149: Description: I am using tikka 1.18 version to parse the docuemtn content. It is working independently when deployed on linux but it is not working. If tessract is used before it. It is giving below error while parseTostring code : Tika tika = new Tika();Tika tika = new Tika(); try(InputStream stream = new FileInputStream(Paths.get(documentPath.concat(documentName)).toAbsolutePath().toString())) { String documentExt = tika.detect(Paths.get(documentPath.concat(documentName)).toAbsolutePath().toString()); String outputStr = tika.parseToString(stream); String tempStr = outputStr.replace("\n", ""); _Logger.info("tempStr: " +tempStr); } catch (TikaException e) \{ // TODO Auto-generated catch block _Logger.error("Error :",e); } Error as : java.lang.StackOverflowError at org.slf4j.impl.JDK14LoggerAdapter.fillCallerData(JDK14LoggerAdapter.java:602) at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:587) at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) at org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) at org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) at java.util.logging.Logger.log(Logger.java:738) at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588) at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) at org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) at org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) at java.util.logging.Logger.log(Logger.java:738) at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588) at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) at org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) at org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) at java.util.logging.Logger.log(Logger.java:738) at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588) at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) at org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) at org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) at java.util.logging.Logger.log(Logger.java:738) ... > kindly let us know the solution was: I am using tikka 1.18 version to parse the docuemtn content. It is working independently when deployed on linux but it is not working. If tessract is used before it. It is giving below error while parseTostring code : Tika tika = new Tika();Tika tika = new Tika(); try(InputStream stream = new FileInputStream(Paths.get(documentPath.concat(documentName)).toAbsolutePath().toString())) { String documentExt = tika.detect(Paths.get(documentPath.concat(documentName)).toAbsolutePath().toString()); String outputStr = tika.parseToString(stream); String tempStr = outputStr.replace("\n", ""); _Logger.info("tempStr: " +tempStr); } catch (TikaException e) \{ // TODO Auto-generated catch block _Logger.error("Error :",e); } Error as : java.lang.StackOverflowError at org.slf4j.impl.JDK14LoggerAdapter.fillCallerData(JDK14LoggerAdapter.java:602) at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:587) at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) at org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) at org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) at java.util.logging.Logger.log(Logger.java:738) at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588) at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) at org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) at org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) at java.util.logging.Logger.log(Logger.java:738) at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588) at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) at org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) at org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) at java.util.logging.Logger.log(Logger.java:738) at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588) at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) at org.slf4j.bridge.SLF4JBridgeHandler.cal
[jira] [Resolved] (TIKA-3149) Tikka 1.18 not working with tess4j 3.4.8 on linux
[ https://issues.apache.org/jira/browse/TIKA-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov resolved TIKA-3149. - Assignee: Konstantin Gribov Resolution: Not A Bug > Tikka 1.18 not working with tess4j 3.4.8 on linux > - > > Key: TIKA-3149 > URL: https://issues.apache.org/jira/browse/TIKA-3149 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.18 > Environment: linux and deployedo n weblogic >Reporter: Vishakha >Assignee: Konstantin Gribov >Priority: Blocker > Labels: starter > > I am using tikka 1.18 version to parse the docuemtn content. It is working > independently when deployed on linux but it is not working. If tessract is > used before it. It is giving below error while parseTostring > code : > Tika tika = new Tika();Tika tika = new Tika(); > try(InputStream stream = new > FileInputStream(Paths.get(documentPath.concat(documentName)).toAbsolutePath().toString())) > { String documentExt = > tika.detect(Paths.get(documentPath.concat(documentName)).toAbsolutePath().toString()); > String outputStr = tika.parseToString(stream); > String tempStr = outputStr.replace("\n", ""); _Logger.info("tempStr: " > +tempStr); } > catch (TikaException e) \{ > // TODO Auto-generated catch block _Logger.error("Error :",e); } > Error as : > java.lang.StackOverflowError > at > org.slf4j.impl.JDK14LoggerAdapter.fillCallerData(JDK14LoggerAdapter.java:602) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:587) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) > at > org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) > at > org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) > at java.util.logging.Logger.log(Logger.java:738) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) > at > org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) > at > org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) > at java.util.logging.Logger.log(Logger.java:738) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) > at > org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) > at > org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) > at java.util.logging.Logger.log(Logger.java:738) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) > at > org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) > at > org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) > at java.util.logging.Logger.log(Logger.java:738) > ... > > > kindly let us know the solution -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3149) Tikka 1.18 not working with tess4j 3.4.8 on linux
[ https://issues.apache.org/jira/browse/TIKA-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17331122#comment-17331122 ] Konstantin Gribov commented on TIKA-3149: - You have both slf4j-jdk14 (logger implementation using java.util.Logging) and jul-to-slf4j (bridge to redirect java.util.Logging to slf4j-api). I recommend to drop slf4j-jdk14 from classpath and use any other logging implementation (logback-classic, log4j2). > Tikka 1.18 not working with tess4j 3.4.8 on linux > - > > Key: TIKA-3149 > URL: https://issues.apache.org/jira/browse/TIKA-3149 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.18 > Environment: linux and deployedo n weblogic >Reporter: Vishakha >Priority: Blocker > Labels: starter > > I am using tikka 1.18 version to parse the docuemtn content. It is working > independently when deployed on linux but it is not working. If tessract is > used before it. It is giving below error while parseTostring > code : > Tika tika = new Tika();Tika tika = new Tika(); > try(InputStream stream = new > FileInputStream(Paths.get(documentPath.concat(documentName)).toAbsolutePath().toString())) > { String documentExt = > tika.detect(Paths.get(documentPath.concat(documentName)).toAbsolutePath().toString()); > String outputStr = tika.parseToString(stream); > String tempStr = outputStr.replace("\n", ""); _Logger.info("tempStr: " > +tempStr); } > catch (TikaException e) \{ > // TODO Auto-generated catch block _Logger.error("Error :",e); } > Error as : > java.lang.StackOverflowError > at > org.slf4j.impl.JDK14LoggerAdapter.fillCallerData(JDK14LoggerAdapter.java:602) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:587) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) > at > org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) > at > org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) > at java.util.logging.Logger.log(Logger.java:738) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) > at > org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) > at > org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) > at java.util.logging.Logger.log(Logger.java:738) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) > at > org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) > at > org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) > at java.util.logging.Logger.log(Logger.java:738) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) > at > org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) > at > org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) > at java.util.logging.Logger.log(Logger.java:738) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) > at > org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) > at > org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) > at java.util.logging.Logger.log(Logger.java:738) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) > at > org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) > at > org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) > at java.util.logging.Logger.log(Logger.java:738) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) > at > org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) > at > org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) > at java.util.logging.Logger.log(Logger.java:738) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588) > at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660) > at > org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) > at > org.slf4j.bridge.
[jira] [Updated] (TIKA-3369) Flaky Tesseract OCR confirmMultiPageTiffHandling test
[ https://issues.apache.org/jira/browse/TIKA-3369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov updated TIKA-3369: Description: Current main@08793d360a838db04a3d23b902c34d9e6e7362e4 fails with {noformat} [ERROR] TesseractOCRParserTest.confirmMultiPageTiffHandling:108->TikaTest.assertContains:79 Page 2 not found in: http://www.w3.org/1999/xhtml";> Multipage TIFF Example Page 1 Multipage TIFF Example Page?2 {noformat} Take note that tesseract extract {{Page?2}} instead of {{Page 2}}. was: Current main@08793d360a838db04a3d23b902c34d9e6e7362e4 fails with {noformat} [ERROR] TesseractOCRParserTest.confirmMultiPageTiffHandling:108->TikaTest.assertContains:79 Page 2 not found in: http://www.w3.org/1999/xhtml";> Multipage TIFF Example Page 1 Multipage TIFF Example Page?2 {noformat} > Flaky Tesseract OCR confirmMultiPageTiffHandling test > - > > Key: TIKA-3369 > URL: https://issues.apache.org/jira/browse/TIKA-3369 > Project: Tika > Issue Type: Test > Components: ocr >Affects Versions: 2.0.0 > Environment: Arch Linux, kernel: 5.11.16-arch1-1 #1 SMP PREEMPT Wed, > 21 Apr 2021 17:22:13 + x86_64 GNU/Linux > OpenJDK 15.0.2.u7-1 > Tesseract 4.1.1-5 with icu 69.1-1, cairo 1.17.4-5, pango 1:1.48.4-1, > tesseract-data-{eng,deu,fra,rus,ukr} 2:4.0.0-1 (other languages not installed) >Reporter: Konstantin Gribov >Priority: Minor > > Current main@08793d360a838db04a3d23b902c34d9e6e7362e4 fails with > {noformat} > [ERROR] > TesseractOCRParserTest.confirmMultiPageTiffHandling:108->TikaTest.assertContains:79 > Page 2 not found in: > http://www.w3.org/1999/xhtml";> > > > /> > content="org.apache.tika.parser.ocr.TesseractOCRParser" /> > > > Multipage > TIFF > Example > Page 1 > Multipage > TIFF > Example > Page?2 > > > {noformat} > Take note that tesseract extract {{Page?2}} instead of {{Page 2}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (TIKA-3369) Flaky Tesseract OCR confirmMultiPageTiffHandling test
Konstantin Gribov created TIKA-3369: --- Summary: Flaky Tesseract OCR confirmMultiPageTiffHandling test Key: TIKA-3369 URL: https://issues.apache.org/jira/browse/TIKA-3369 Project: Tika Issue Type: Test Components: ocr Affects Versions: 2.0.0 Environment: Arch Linux, kernel: 5.11.16-arch1-1 #1 SMP PREEMPT Wed, 21 Apr 2021 17:22:13 + x86_64 GNU/Linux OpenJDK 15.0.2.u7-1 Tesseract 4.1.1-5 with icu 69.1-1, cairo 1.17.4-5, pango 1:1.48.4-1, tesseract-data-{eng,deu,fra,rus,ukr} 2:4.0.0-1 (other languages not installed) Reporter: Konstantin Gribov Current main@08793d360a838db04a3d23b902c34d9e6e7362e4 fails with {noformat} [ERROR] TesseractOCRParserTest.confirmMultiPageTiffHandling:108->TikaTest.assertContains:79 Page 2 not found in: http://www.w3.org/1999/xhtml";> Multipage TIFF Example Page 1 Multipage TIFF Example Page?2 {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[RFC] Tika BOMs/platforms
Hi, folks. I hope for comments and kind of lazy consensus. If there would be no objections I'll merge it to main and branch_1x. I created tika-bom modules with bill-of-materials (in Apache Maven terminology) / platform (for Gradle users). It will allow easy Tika module versions alignment and to write Tika it once when importing BOM. Downside is that adding a new downstream-consumable module (like another parser one) will require to not forget to add it to tika-bom. BOMs are both for 2.x [1, 2] and 1.x [3, 4] branches. In case of 1.x tika-bom includes almost all artifacts useful as dependencies in downstream projects. So Tika end user applications like tika-server and tika-app shouldn't be included. For 2.x mostly the same (but much more modules xD). Also included artifacts like eval-core, server-core etc. Tika Pipes are in separate bom (tika-pipes pom module itself). [1]: https://issues.apache.org/jira/browse/TIKA-3367 [2]: https://github.com/apache/tika/pull/431 [3]: https://issues.apache.org/jira/browse/TIKA-3368 [4]: https://github.com/apache/tika/pull/432 -- Best regards, Konstantin Gribov.
[jira] [Commented] (TIKA-3368) Add Bill of Materials (BOM) artifact (Tika 1.x)
[ https://issues.apache.org/jira/browse/TIKA-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17331107#comment-17331107 ] ASF GitHub Bot commented on TIKA-3368: -- grossws opened a new pull request #432: URL: https://github.com/apache/tika/pull/432 Fixes #TIKA-3368 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Bill of Materials (BOM) artifact (Tika 1.x) > --- > > Key: TIKA-3368 > URL: https://issues.apache.org/jira/browse/TIKA-3368 > Project: Tika > Issue Type: Improvement > Components: packaging >Reporter: Konstantin Gribov >Assignee: Konstantin Gribov >Priority: Major > Fix For: 1.27 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [tika] grossws opened a new pull request #432: [TIKA-3368] Add tika-bom module
grossws opened a new pull request #432: URL: https://github.com/apache/tika/pull/432 Fixes #TIKA-3368 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (TIKA-3367) Add Bill of Materials (BOM) artifact
[ https://issues.apache.org/jira/browse/TIKA-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17331102#comment-17331102 ] ASF GitHub Bot commented on TIKA-3367: -- grossws opened a new pull request #431: URL: https://github.com/apache/tika/pull/431 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Bill of Materials (BOM) artifact > > > Key: TIKA-3367 > URL: https://issues.apache.org/jira/browse/TIKA-3367 > Project: Tika > Issue Type: Improvement > Components: packaging >Reporter: Konstantin Gribov >Assignee: Konstantin Gribov >Priority: Major > Fix For: 2.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [tika] grossws opened a new pull request #431: [TIKA-3367] Add Bill of Materials (BOM)
grossws opened a new pull request #431: URL: https://github.com/apache/tika/pull/431 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (TIKA-3368) Add Bill of Materials (BOM) artifact (Tika 1.x)
Konstantin Gribov created TIKA-3368: --- Summary: Add Bill of Materials (BOM) artifact (Tika 1.x) Key: TIKA-3368 URL: https://issues.apache.org/jira/browse/TIKA-3368 Project: Tika Issue Type: Improvement Components: packaging Reporter: Konstantin Gribov Assignee: Konstantin Gribov Fix For: 1.27 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (TIKA-3367) Add Bill of Materials (BOM) artifact
[ https://issues.apache.org/jira/browse/TIKA-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov updated TIKA-3367: Fix Version/s: (was: 1.27) > Add Bill of Materials (BOM) artifact > > > Key: TIKA-3367 > URL: https://issues.apache.org/jira/browse/TIKA-3367 > Project: Tika > Issue Type: Improvement > Components: packaging >Reporter: Konstantin Gribov >Assignee: Konstantin Gribov >Priority: Major > Fix For: 2.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (TIKA-3367) Add Bill of Materials (BOM) artifact
Konstantin Gribov created TIKA-3367: --- Summary: Add Bill of Materials (BOM) artifact Key: TIKA-3367 URL: https://issues.apache.org/jira/browse/TIKA-3367 Project: Tika Issue Type: Improvement Components: packaging Reporter: Konstantin Gribov Assignee: Konstantin Gribov Fix For: 2.0.0, 1.27 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (TIKA-3363) Have tika-docker artifacts start in spawn mode (configurable)
[ https://issues.apache.org/jira/browse/TIKA-3363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved TIKA-3363. Fix Version/s: (was: 1.27) Resolution: Won't Fix > Have tika-docker artifacts start in spawn mode (configurable) > - > > Key: TIKA-3363 > URL: https://issues.apache.org/jira/browse/TIKA-3363 > Project: Tika > Issue Type: Improvement > Components: docker, tika-docker >Affects Versions: 1.26 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > > I would like to poll tika-docker users as to whether we should turn on by > default, but make configurable (build and deploy overrides) use of the > *-spawnChild* flag. > See > https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-MakingTikaServerRobusttoOOMs,InfiniteLoopsandMemoryLeaks > for more documentation on this topic. > Right now it is impossible to configure this in tika-helm unless it is > configurable from the tika-docker artifact. Thank you > [~dmeikle] FYI -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (TIKA-3363) Have tika-docker artifacts start in spawn mode (configurable)
[ https://issues.apache.org/jira/browse/TIKA-3363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney closed TIKA-3363. -- > Have tika-docker artifacts start in spawn mode (configurable) > - > > Key: TIKA-3363 > URL: https://issues.apache.org/jira/browse/TIKA-3363 > Project: Tika > Issue Type: Improvement > Components: docker, tika-docker >Affects Versions: 1.26 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > > I would like to poll tika-docker users as to whether we should turn on by > default, but make configurable (build and deploy overrides) use of the > *-spawnChild* flag. > See > https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-MakingTikaServerRobusttoOOMs,InfiniteLoopsandMemoryLeaks > for more documentation on this topic. > Right now it is impossible to configure this in tika-helm unless it is > configurable from the tika-docker artifact. Thank you > [~dmeikle] FYI -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (TIKA-3366) Retrospective release of tika-docker 2.0.0-ALPHA
Lewis John McGibbney created TIKA-3366: -- Summary: Retrospective release of tika-docker 2.0.0-ALPHA Key: TIKA-3366 URL: https://issues.apache.org/jira/browse/TIKA-3366 Project: Tika Issue Type: Improvement Components: docker Affects Versions: 2.0.0-ALPHA Reporter: Lewis John McGibbney Fix For: 2.0.0-ALPHA I recently created TIKA-3363 with the goal of making spawnChild mode configurable from within the docker image. I just discovered that tika 2.0.0 (main) introduces [tika-server-config-default.xml|https://github.com/apache/tika/blob/main/tika-server/tika-server-core/src/main/resources/tika-server-config-default.xml#L54-L63] which implements the equivalent of spawnChild mode by default. Additionally, for those that wish to configure this further, by mounting this custom configuration as a volume, they can pass the [command line parameter to load it|https://github.com/apache/tika-docker#custom-config]. This ticket therefore supersedes TIKA-3363. Ultimately we will need to wait for the release of 2.0.0 proper before this configuration comes activated by default. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[INVITATION] Apache Tika container orchestration meetup
Hi Folks, If you are interested in participating in a mini meetup based around Apache Tika container orchestration then please indicate your preferred availability at the Doodle Poll below. This community meetup focuses on Tika container orchestration (Docker, Docker Compose, Helm, Kubernetes, etc.). Anyone is invited to join :) https://doodle.com/poll/zf3kfhmn5b7626kk?utm_source=poll&utm_medium=link Thank you lewismc -- http://home.apache.org/~lewismc/ http://people.apache.org/keys/committer/lewismc
[jira] [Comment Edited] (TIKA-3364) PDF Content is extracted twice
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330882#comment-17330882 ] David Pilato edited comment on TIKA-3364 at 4/23/21, 4:05 PM: -- Oh my god! I'm feeling stupid. Anyway, I was not able to choose this method as [it's not a public one|https://github.com/apache/tika/blob/be8817e7cdc511668ee9dfa7c385a65168a0051a/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java#L502-L505]. But I used {code:java} pdfParser.getPDFParserConfig().setExtractBookmarksText(false); {code} It works. But I'm still seeing the double text (instead of triple), when OCR is on. was (Author: dadoonet): Oh my god! I'm feeling stupid. Anyway, I was not able to choose this method as [it's not a public one|https://github.com/apache/tika/blob/be8817e7cdc511668ee9dfa7c385a65168a0051a/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java#L502-L505]. But I used {code:java} pdfParser.getPDFParserConfig().setExtractBookmarksText(false); {code} It works. I'm still seeing the double text (instead of triple), when OCR is on. > PDF Content is extracted twice > -- > > Key: TIKA-3364 > URL: https://issues.apache.org/jira/browse/TIKA-3364 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.26 >Reporter: David Pilato >Priority: Major > Attachments: Screenshot from 2021-04-23 10-15-22.png, issue-1097.pdf, > tika-bookmarks-config.xml > > > Hi > Coming from [this issue in FSCrawler > project|https://github.com/dadoonet/fscrawler/issues/1097], I can see that > the text from the PDF document is extracted more than once although PDFBox > seems to extract it only once. > I attached the PDF. > When I run: > {code:sh} > wget https://downloads.apache.org/pdfbox/2.0.23/pdfbox-app-2.0.23.jar > java -jar pdfbox-app-2.0.23.jar ExtractText -console issue-1097.pdf > {code} > I'm getting: > {code:sh} > Dummy PDF file > {code} > But with Tika: > {code:sh} > wget https://downloads.apache.org/tika/tika-app-1.26.jar > java -jar tika-app-1.26.jar > {code} > I'm getting: > {code:xml} > xmlns="http://www.w3.org/1999/xhtml";> > > > > > > > > > > > > > > > > > > > > > > > > content="3df79d34abbca99308e79cb94461c1893582604d68329a41fd4bec1885e6adb4"/> > > > > > > > > > > > > > > > > > Dummy PDF file > > > Dummy PDF file > > > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (TIKA-3364) PDF Content is extracted twice
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330882#comment-17330882 ] David Pilato edited comment on TIKA-3364 at 4/23/21, 4:04 PM: -- Oh my god! I'm feeling stupid. Anyway, I was not able to choose this method as [it's not a public one|https://github.com/apache/tika/blob/be8817e7cdc511668ee9dfa7c385a65168a0051a/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java#L502-L505]. But I used {code:java} pdfParser.getPDFParserConfig().setExtractBookmarksText(false); {code} It works. I'm still seeing the double text (instead of triple), when OCR is on. was (Author: dadoonet): Oh my god! I'm feeling stupid. Anyway, I was not able to choose this method as [it's not a public one|https://github.com/apache/tika/blob/be8817e7cdc511668ee9dfa7c385a65168a0051a/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java#L502-L505]. But I used {code:java} pdfParser.getPDFParserConfig().setExtractBookmarksText(false); {code} > PDF Content is extracted twice > -- > > Key: TIKA-3364 > URL: https://issues.apache.org/jira/browse/TIKA-3364 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.26 >Reporter: David Pilato >Priority: Major > Attachments: Screenshot from 2021-04-23 10-15-22.png, issue-1097.pdf, > tika-bookmarks-config.xml > > > Hi > Coming from [this issue in FSCrawler > project|https://github.com/dadoonet/fscrawler/issues/1097], I can see that > the text from the PDF document is extracted more than once although PDFBox > seems to extract it only once. > I attached the PDF. > When I run: > {code:sh} > wget https://downloads.apache.org/pdfbox/2.0.23/pdfbox-app-2.0.23.jar > java -jar pdfbox-app-2.0.23.jar ExtractText -console issue-1097.pdf > {code} > I'm getting: > {code:sh} > Dummy PDF file > {code} > But with Tika: > {code:sh} > wget https://downloads.apache.org/tika/tika-app-1.26.jar > java -jar tika-app-1.26.jar > {code} > I'm getting: > {code:xml} > xmlns="http://www.w3.org/1999/xhtml";> > > > > > > > > > > > > > > > > > > > > > > > > content="3df79d34abbca99308e79cb94461c1893582604d68329a41fd4bec1885e6adb4"/> > > > > > > > > > > > > > > > > > Dummy PDF file > > > Dummy PDF file > > > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (TIKA-3364) PDF Content is extracted twice
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330882#comment-17330882 ] David Pilato edited comment on TIKA-3364 at 4/23/21, 4:03 PM: -- Oh my god! I'm feeling stupid. Anyway, I was not able to choose this method as [it's not a public one|https://github.com/apache/tika/blob/be8817e7cdc511668ee9dfa7c385a65168a0051a/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java#L502-L505]. But I used {code:java} pdfParser.getPDFParserConfig().setExtractBookmarksText(false); {code} was (Author: dadoonet): Oh my god! I'm feeling stupid. Anyway, I was not able to choose this method as [it's not a public one|https://github.com/apache/tika/blob/be8817e7cdc511668ee9dfa7c385a65168a0051a/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java#L502-L505]. > PDF Content is extracted twice > -- > > Key: TIKA-3364 > URL: https://issues.apache.org/jira/browse/TIKA-3364 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.26 >Reporter: David Pilato >Priority: Major > Attachments: Screenshot from 2021-04-23 10-15-22.png, issue-1097.pdf, > tika-bookmarks-config.xml > > > Hi > Coming from [this issue in FSCrawler > project|https://github.com/dadoonet/fscrawler/issues/1097], I can see that > the text from the PDF document is extracted more than once although PDFBox > seems to extract it only once. > I attached the PDF. > When I run: > {code:sh} > wget https://downloads.apache.org/pdfbox/2.0.23/pdfbox-app-2.0.23.jar > java -jar pdfbox-app-2.0.23.jar ExtractText -console issue-1097.pdf > {code} > I'm getting: > {code:sh} > Dummy PDF file > {code} > But with Tika: > {code:sh} > wget https://downloads.apache.org/tika/tika-app-1.26.jar > java -jar tika-app-1.26.jar > {code} > I'm getting: > {code:xml} > xmlns="http://www.w3.org/1999/xhtml";> > > > > > > > > > > > > > > > > > > > > > > > > content="3df79d34abbca99308e79cb94461c1893582604d68329a41fd4bec1885e6adb4"/> > > > > > > > > > > > > > > > > > Dummy PDF file > > > Dummy PDF file > > > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (TIKA-3364) PDF Content is extracted twice
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330882#comment-17330882 ] David Pilato edited comment on TIKA-3364 at 4/23/21, 4:03 PM: -- Oh my god! I'm feeling stupid. Anyway, I was not able to choose this method as [it's not a public one|https://github.com/apache/tika/blob/be8817e7cdc511668ee9dfa7c385a65168a0051a/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java#L502-L505]. But I used {code:java} pdfParser.getPDFParserConfig().setExtractBookmarksText(false); {code} was (Author: dadoonet): Oh my god! I'm feeling stupid. Anyway, I was not able to choose this method as [it's not a public one|https://github.com/apache/tika/blob/be8817e7cdc511668ee9dfa7c385a65168a0051a/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java#L502-L505]. But I used {code:java} pdfParser.getPDFParserConfig().setExtractBookmarksText(false); {code} > PDF Content is extracted twice > -- > > Key: TIKA-3364 > URL: https://issues.apache.org/jira/browse/TIKA-3364 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.26 >Reporter: David Pilato >Priority: Major > Attachments: Screenshot from 2021-04-23 10-15-22.png, issue-1097.pdf, > tika-bookmarks-config.xml > > > Hi > Coming from [this issue in FSCrawler > project|https://github.com/dadoonet/fscrawler/issues/1097], I can see that > the text from the PDF document is extracted more than once although PDFBox > seems to extract it only once. > I attached the PDF. > When I run: > {code:sh} > wget https://downloads.apache.org/pdfbox/2.0.23/pdfbox-app-2.0.23.jar > java -jar pdfbox-app-2.0.23.jar ExtractText -console issue-1097.pdf > {code} > I'm getting: > {code:sh} > Dummy PDF file > {code} > But with Tika: > {code:sh} > wget https://downloads.apache.org/tika/tika-app-1.26.jar > java -jar tika-app-1.26.jar > {code} > I'm getting: > {code:xml} > xmlns="http://www.w3.org/1999/xhtml";> > > > > > > > > > > > > > > > > > > > > > > > > content="3df79d34abbca99308e79cb94461c1893582604d68329a41fd4bec1885e6adb4"/> > > > > > > > > > > > > > > > > > Dummy PDF file > > > Dummy PDF file > > > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3364) PDF Content is extracted twice
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330882#comment-17330882 ] David Pilato commented on TIKA-3364: Oh my god! I'm feeling stupid. Anyway, I was not able to choose this method as [it's not a public one|https://github.com/apache/tika/blob/be8817e7cdc511668ee9dfa7c385a65168a0051a/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java#L502-L505]. > PDF Content is extracted twice > -- > > Key: TIKA-3364 > URL: https://issues.apache.org/jira/browse/TIKA-3364 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.26 >Reporter: David Pilato >Priority: Major > Attachments: Screenshot from 2021-04-23 10-15-22.png, issue-1097.pdf, > tika-bookmarks-config.xml > > > Hi > Coming from [this issue in FSCrawler > project|https://github.com/dadoonet/fscrawler/issues/1097], I can see that > the text from the PDF document is extracted more than once although PDFBox > seems to extract it only once. > I attached the PDF. > When I run: > {code:sh} > wget https://downloads.apache.org/pdfbox/2.0.23/pdfbox-app-2.0.23.jar > java -jar pdfbox-app-2.0.23.jar ExtractText -console issue-1097.pdf > {code} > I'm getting: > {code:sh} > Dummy PDF file > {code} > But with Tika: > {code:sh} > wget https://downloads.apache.org/tika/tika-app-1.26.jar > java -jar tika-app-1.26.jar > {code} > I'm getting: > {code:xml} > xmlns="http://www.w3.org/1999/xhtml";> > > > > > > > > > > > > > > > > > > > > > > > > content="3df79d34abbca99308e79cb94461c1893582604d68329a41fd4bec1885e6adb4"/> > > > > > > > > > > > > > > > > > Dummy PDF file > > > Dummy PDF file > > > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (TIKA-3365) RTFParser to XMLContentHandler incorrectly interprets en dash.
Gordon Allen created TIKA-3365: -- Summary: RTFParser to XMLContentHandler incorrectly interprets en dash. Key: TIKA-3365 URL: https://issues.apache.org/jira/browse/TIKA-3365 Project: Tika Issue Type: Bug Components: handler, parser Affects Versions: 1.26 Environment: macOS Catalina 10.15.7 Java version "15" 2020-09-15 Eclipse 2020-12 Reporter: Gordon Allen If the RTF document contains an en-dash "\endash" the resultant HTML output from the handler is "¿¿¿" instead of "–" Not sure if the issue is in the Parser or Handler. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3364) PDF Content is extracted twice
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330851#comment-17330851 ] Tim Allison commented on TIKA-3364: --- try {{pdfParser.setExtractBookmarksText(false);}} > PDF Content is extracted twice > -- > > Key: TIKA-3364 > URL: https://issues.apache.org/jira/browse/TIKA-3364 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.26 >Reporter: David Pilato >Priority: Major > Attachments: Screenshot from 2021-04-23 10-15-22.png, issue-1097.pdf, > tika-bookmarks-config.xml > > > Hi > Coming from [this issue in FSCrawler > project|https://github.com/dadoonet/fscrawler/issues/1097], I can see that > the text from the PDF document is extracted more than once although PDFBox > seems to extract it only once. > I attached the PDF. > When I run: > {code:sh} > wget https://downloads.apache.org/pdfbox/2.0.23/pdfbox-app-2.0.23.jar > java -jar pdfbox-app-2.0.23.jar ExtractText -console issue-1097.pdf > {code} > I'm getting: > {code:sh} > Dummy PDF file > {code} > But with Tika: > {code:sh} > wget https://downloads.apache.org/tika/tika-app-1.26.jar > java -jar tika-app-1.26.jar > {code} > I'm getting: > {code:xml} > xmlns="http://www.w3.org/1999/xhtml";> > > > > > > > > > > > > > > > > > > > > > > > > content="3df79d34abbca99308e79cb94461c1893582604d68329a41fd4bec1885e6adb4"/> > > > > > > > > > > > > > > > > > Dummy PDF file > > > Dummy PDF file > > > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
CFP for ApacheCon 2021 closes in ONE WEEK
[You are receiving this because you're subscribed to one or more dev@ mailing lists for an Apache project, or the ApacheCon Announce list.] Time is running out to submit your talk for ApacheCon 2021. The Call for Presentations for ApacheCon @Home 2021, focused on Europe and North America time zones, closes May 3rd, and is at https://www.apachecon.com/acah2021/cfp.html The CFP for ApacheCon Asia, focused on Asia/Pacific time zones, is at https://apachecon.com/acasia2021/cfp.html and also closes on May 3rd. ApacheCon is our main event, featuring content from any and all of our projects, and is your best opportunity to get your project in front of the largest audience of enthusiasts. Please don't wait for the last minute. Get your talks in today! -- Rich Bowen, VP Conferences The Apache Software Foundation https://apachecon.com/ @apachecon
[jira] [Commented] (TIKA-3364) PDF Content is extracted twice
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330827#comment-17330827 ] Nick Burch commented on TIKA-3364: -- I'm not sure if we already have outlines/bookmarks elsewhere in other parsers, to copy the suggested markup But some sort of annotation on the xhtml makes sense to me! > PDF Content is extracted twice > -- > > Key: TIKA-3364 > URL: https://issues.apache.org/jira/browse/TIKA-3364 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.26 >Reporter: David Pilato >Priority: Major > Attachments: Screenshot from 2021-04-23 10-15-22.png, issue-1097.pdf, > tika-bookmarks-config.xml > > > Hi > Coming from [this issue in FSCrawler > project|https://github.com/dadoonet/fscrawler/issues/1097], I can see that > the text from the PDF document is extracted more than once although PDFBox > seems to extract it only once. > I attached the PDF. > When I run: > {code:sh} > wget https://downloads.apache.org/pdfbox/2.0.23/pdfbox-app-2.0.23.jar > java -jar pdfbox-app-2.0.23.jar ExtractText -console issue-1097.pdf > {code} > I'm getting: > {code:sh} > Dummy PDF file > {code} > But with Tika: > {code:sh} > wget https://downloads.apache.org/tika/tika-app-1.26.jar > java -jar tika-app-1.26.jar > {code} > I'm getting: > {code:xml} > xmlns="http://www.w3.org/1999/xhtml";> > > > > > > > > > > > > > > > > > > > > > > > > content="3df79d34abbca99308e79cb94461c1893582604d68329a41fd4bec1885e6adb4"/> > > > > > > > > > > > > > > > > > Dummy PDF file > > > Dummy PDF file > > > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (TIKA-3364) PDF Content is extracted twice
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330824#comment-17330824 ] David Pilato edited comment on TIKA-3364 at 4/23/21, 2:39 PM: -- So I tried this: {code:java} PDFParser pdfParser = new PDFParser(); DefaultParser defaultParser; pdfParser.setExtractAnnotationText(false); if (!fs.getOcr().isEnabled()) { logger.debug("OCR is disabled. Even though it's detected, it must be disabled explicitly"); defaultParser = new DefaultParser( MediaTypeRegistry.getDefaultRegistry(), new ServiceLoader(), Collections.singletonList(TesseractOCRParser.class)); } else { logger.debug("OCR is activated."); if (ExternalParser.check("tesseract")) { logger.debug("OCR strategy for PDF documents is [{}] and tesseract was found.", fs.getOcr().getPdfStrategy()); pdfParser.setOcrStrategy(fs.getOcr().getPdfStrategy()); } else { logger.debug("But Tesseract is not installed so we won't run OCR."); pdfParser.setOcrStrategy("no_ocr"); } defaultParser = new DefaultParser( MediaTypeRegistry.getDefaultRegistry(), new ServiceLoader(), Collections.singletonList(PDFParser.class)); } parser = new AutoDetectParser(defaultParser, pdfParser); {code} And it seems to be producing the same effect. I'm probably missing something. When I run it with this configuration, the extracted text is actually: {code:none} \nDummy PDF file\n\nDummy PDF file\n\n\n\tDummy PDF file\n\n {code} So the text is extracted 3 times. When I disable OCR with {{pdfParser.setOcrStrategy("no_ocr")}}, I'm getting: {code:none} \nDummy PDF file\n\n\n\tDummy PDF file\n\n {code} was (Author: dadoonet): So I tried this: {code:java} PDFParser pdfParser = new PDFParser(); DefaultParser defaultParser; pdfParser.setExtractAnnotationText(false); if (!fs.getOcr().isEnabled()) { logger.debug("OCR is disabled. Even though it's detected, it must be disabled explicitly"); defaultParser = new DefaultParser( MediaTypeRegistry.getDefaultRegistry(), new ServiceLoader(), Collections.singletonList(TesseractOCRParser.class)); } else { logger.debug("OCR is activated."); if (ExternalParser.check("tesseract")) { logger.debug("OCR strategy for PDF documents is [{}] and tesseract was found.", fs.getOcr().getPdfStrategy()); pdfParser.setOcrStrategy(fs.getOcr().getPdfStrategy()); } else { logger.debug("But Tesseract is not installed so we won't run OCR."); pdfParser.setOcrStrategy("no_ocr"); } defaultParser = new DefaultParser( MediaTypeRegistry.getDefaultRegistry(), new ServiceLoader(), Collections.singletonList(PDFParser.class)); } parser = new AutoDetectParser(defaultParser, pdfParser); {code} And it seems to be producing the same effect. I'm probably missing something. When I run it with this configuration, the extracted text is actually: {code:txt} \nDummy PDF file\n\nDummy PDF file\n\n\n\tDummy PDF file\n\n {code} So the text is extracted 3 times. When I disable OCR with {{pdfParser.setOcrStrategy("no_ocr")}}, I'm getting: {code:txt} \nDummy PDF file\n\n\n\tDummy PDF file\n\n {code} > PDF Content is extracted twice > -- > > Key: TIKA-3364 > URL: https://issues.apache.org/jira/browse/TIKA-3364 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.26 >Reporter: David Pilato >Priority: Major > Attachments: Screenshot from 2021-04-23 10-15-22.png, issue-1097.pdf, > tika-bookmarks-config.xml > > > Hi > Coming from [this issue in FSCrawler > project|https://github.com/dadoonet/fscrawler/issues/1097], I can see that > the text from the PDF document is extracted more than once although PDFBox > seems to extract it only once. > I attached the PDF. > When I run: > {code:sh} > wget https://downloads.apache.org/pdfbox/2.0.23/pdfbox-app-2.0.23.jar > java -jar pdfbox-app-2.0.23.jar ExtractText -console issue-1097.pdf > {code} > I'm getting: > {code:sh} > Dummy PDF file > {code} > But with Tika: > {code:sh} > wget https://
[jira] [Commented] (TIKA-3364) PDF Content is extracted twice
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330824#comment-17330824 ] David Pilato commented on TIKA-3364: So I trie this: {code:java} PDFParser pdfParser = new PDFParser(); DefaultParser defaultParser; pdfParser.setExtractAnnotationText(false); if (!fs.getOcr().isEnabled()) { logger.debug("OCR is disabled. Even though it's detected, it must be disabled explicitly"); defaultParser = new DefaultParser( MediaTypeRegistry.getDefaultRegistry(), new ServiceLoader(), Collections.singletonList(TesseractOCRParser.class)); } else { logger.debug("OCR is activated."); if (ExternalParser.check("tesseract")) { logger.debug("OCR strategy for PDF documents is [{}] and tesseract was found.", fs.getOcr().getPdfStrategy()); pdfParser.setOcrStrategy(fs.getOcr().getPdfStrategy()); } else { logger.debug("But Tesseract is not installed so we won't run OCR."); pdfParser.setOcrStrategy("no_ocr"); } defaultParser = new DefaultParser( MediaTypeRegistry.getDefaultRegistry(), new ServiceLoader(), Collections.singletonList(PDFParser.class)); } parser = new AutoDetectParser(defaultParser, pdfParser); {code} And it seems to be producing the same effect. I'm probably missing something. When I run it with this configuration, the extracted text is actually: {code:txt} \nDummy PDF file\n\nDummy PDF file\n\n\n\tDummy PDF file\n\n {code} So the text is extracted 3 times. When I disable OCR with {{pdfParser.setOcrStrategy("no_ocr")}}, I'm getting: {code:txt} \nDummy PDF file\n\n\n\tDummy PDF file\n\n {code} > PDF Content is extracted twice > -- > > Key: TIKA-3364 > URL: https://issues.apache.org/jira/browse/TIKA-3364 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.26 >Reporter: David Pilato >Priority: Major > Attachments: Screenshot from 2021-04-23 10-15-22.png, issue-1097.pdf, > tika-bookmarks-config.xml > > > Hi > Coming from [this issue in FSCrawler > project|https://github.com/dadoonet/fscrawler/issues/1097], I can see that > the text from the PDF document is extracted more than once although PDFBox > seems to extract it only once. > I attached the PDF. > When I run: > {code:sh} > wget https://downloads.apache.org/pdfbox/2.0.23/pdfbox-app-2.0.23.jar > java -jar pdfbox-app-2.0.23.jar ExtractText -console issue-1097.pdf > {code} > I'm getting: > {code:sh} > Dummy PDF file > {code} > But with Tika: > {code:sh} > wget https://downloads.apache.org/tika/tika-app-1.26.jar > java -jar tika-app-1.26.jar > {code} > I'm getting: > {code:xml} > xmlns="http://www.w3.org/1999/xhtml";> > > > > > > > > > > > > > > > > > > > > > > > > content="3df79d34abbca99308e79cb94461c1893582604d68329a41fd4bec1885e6adb4"/> > > > > > > > > > > > > > > > > > Dummy PDF file > > > Dummy PDF file > > > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (TIKA-3364) PDF Content is extracted twice
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330824#comment-17330824 ] David Pilato edited comment on TIKA-3364 at 4/23/21, 2:38 PM: -- So I tried this: {code:java} PDFParser pdfParser = new PDFParser(); DefaultParser defaultParser; pdfParser.setExtractAnnotationText(false); if (!fs.getOcr().isEnabled()) { logger.debug("OCR is disabled. Even though it's detected, it must be disabled explicitly"); defaultParser = new DefaultParser( MediaTypeRegistry.getDefaultRegistry(), new ServiceLoader(), Collections.singletonList(TesseractOCRParser.class)); } else { logger.debug("OCR is activated."); if (ExternalParser.check("tesseract")) { logger.debug("OCR strategy for PDF documents is [{}] and tesseract was found.", fs.getOcr().getPdfStrategy()); pdfParser.setOcrStrategy(fs.getOcr().getPdfStrategy()); } else { logger.debug("But Tesseract is not installed so we won't run OCR."); pdfParser.setOcrStrategy("no_ocr"); } defaultParser = new DefaultParser( MediaTypeRegistry.getDefaultRegistry(), new ServiceLoader(), Collections.singletonList(PDFParser.class)); } parser = new AutoDetectParser(defaultParser, pdfParser); {code} And it seems to be producing the same effect. I'm probably missing something. When I run it with this configuration, the extracted text is actually: {code:txt} \nDummy PDF file\n\nDummy PDF file\n\n\n\tDummy PDF file\n\n {code} So the text is extracted 3 times. When I disable OCR with {{pdfParser.setOcrStrategy("no_ocr")}}, I'm getting: {code:txt} \nDummy PDF file\n\n\n\tDummy PDF file\n\n {code} was (Author: dadoonet): So I trie this: {code:java} PDFParser pdfParser = new PDFParser(); DefaultParser defaultParser; pdfParser.setExtractAnnotationText(false); if (!fs.getOcr().isEnabled()) { logger.debug("OCR is disabled. Even though it's detected, it must be disabled explicitly"); defaultParser = new DefaultParser( MediaTypeRegistry.getDefaultRegistry(), new ServiceLoader(), Collections.singletonList(TesseractOCRParser.class)); } else { logger.debug("OCR is activated."); if (ExternalParser.check("tesseract")) { logger.debug("OCR strategy for PDF documents is [{}] and tesseract was found.", fs.getOcr().getPdfStrategy()); pdfParser.setOcrStrategy(fs.getOcr().getPdfStrategy()); } else { logger.debug("But Tesseract is not installed so we won't run OCR."); pdfParser.setOcrStrategy("no_ocr"); } defaultParser = new DefaultParser( MediaTypeRegistry.getDefaultRegistry(), new ServiceLoader(), Collections.singletonList(PDFParser.class)); } parser = new AutoDetectParser(defaultParser, pdfParser); {code} And it seems to be producing the same effect. I'm probably missing something. When I run it with this configuration, the extracted text is actually: {code:txt} \nDummy PDF file\n\nDummy PDF file\n\n\n\tDummy PDF file\n\n {code} So the text is extracted 3 times. When I disable OCR with {{pdfParser.setOcrStrategy("no_ocr")}}, I'm getting: {code:txt} \nDummy PDF file\n\n\n\tDummy PDF file\n\n {code} > PDF Content is extracted twice > -- > > Key: TIKA-3364 > URL: https://issues.apache.org/jira/browse/TIKA-3364 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.26 >Reporter: David Pilato >Priority: Major > Attachments: Screenshot from 2021-04-23 10-15-22.png, issue-1097.pdf, > tika-bookmarks-config.xml > > > Hi > Coming from [this issue in FSCrawler > project|https://github.com/dadoonet/fscrawler/issues/1097], I can see that > the text from the PDF document is extracted more than once although PDFBox > seems to extract it only once. > I attached the PDF. > When I run: > {code:sh} > wget https://downloads.apache.org/pdfbox/2.0.23/pdfbox-app-2.0.23.jar > java -jar pdfbox-app-2.0.23.jar ExtractText -console issue-1097.pdf > {code} > I'm getting: > {code:sh} > Dummy PDF file > {code} > But with Tika: > {code:sh} > wget https://dow
[jira] [Commented] (TIKA-3324) Add checkstyle checker
[ https://issues.apache.org/jira/browse/TIKA-3324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330823#comment-17330823 ] Hudson commented on TIKA-3324: -- FAILURE: Integrated in Jenkins build Tika » tika-main-jdk8 #205 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/205/]) TIKA-3324 -- update pom files to index 2 spaces (we had some differences); (tallison: [https://github.com/apache/tika/commit/f53c527552da57bf2936eff2ae6c326a59bf095d]) * (edit) tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/ExtractComparer.java * (edit) tika-batch/pom.xml * (edit) tika-eval/tika-eval-core/src/main/java/org/apache/tika/eval/core/textstats/TokenCountPriorityQueue.java * (edit) tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/batch/FileProfilerBuilder.java * (edit) tika-example/src/main/java/org/apache/tika/example/MetadataAwareLuceneIndexer.java * (edit) tika-eval/tika-eval-core/src/main/java/org/apache/tika/eval/core/util/ContentTagParser.java * (edit) tika-eval/tika-eval-core/src/test/java/org/apache/tika/eval/core/tokens/TokenCounterTest.java * (edit) tika-eval/tika-eval-core/src/main/java/org/apache/tika/eval/core/tokens/CJKBigramAwareLengthFilterFactory.java * (edit) tika-parsers/tika-parsers-extended/pom.xml * (edit) tika-server/tika-server-client/pom.xml * (edit) tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java * (edit) tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-font-module/pom.xml * (edit) tika-translate/pom.xml * (edit) tika-parsers/pom.xml * (edit) tika-pipes/tika-emitters/tika-emitter-s3/pom.xml * (edit) tika-example/src/main/java/org/apache/tika/example/RollbackSoftware.java * (edit) tika-serialization/src/main/java/org/apache/tika/metadata/serialization/JsonStreamingSerializer.java * (edit) tika-example/src/main/java/org/apache/tika/example/DisplayMetInstance.java * (edit) tika-parsers/tika-parsers-advanced/tika-age-recogniser/pom.xml * (edit) tika-batch/src/main/java/org/apache/tika/batch/builders/ParserFactoryBuilder.java * (edit) tika-batch/src/main/java/org/apache/tika/batch/builders/StatusReporterBuilder.java * (edit) tika-eval/tika-eval-app/pom.xml * (edit) tika-eval/tika-eval-app/src/test/java/org/apache/tika/eval/app/AnalyzerManagerTest.java * (edit) tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/FileProfiler.java * (edit) tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/tools/TopCommonTokenCounter.java * (edit) tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/reports/XLSXNumFormatter.java * (edit) tika-parsers/tika-parsers-extended/tika-parser-sqlite3-module/pom.xml * (edit) tika-parsers/tika-parsers-advanced/tika-dl/pom.xml * (edit) tika-batch/src/main/java/org/apache/tika/batch/IFileProcessorFutureResult.java * (edit) tika-example/src/main/java/org/apache/tika/example/EncryptedPrescriptionDetector.java * (edit) tika-eval/tika-eval-core/src/test/java/org/apache/tika/eval/core/langid/LangIdTest.java * (edit) tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/tools/SlowCompositeReaderWrapper.java * (edit) tika-batch/src/test/java/org/apache/tika/batch/fs/BatchProcessTest.java * (edit) tika-batch/src/main/java/org/apache/tika/batch/builders/InterrupterBuilder.java * (edit) tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-ocr-module/pom.xml * (edit) tika-eval/tika-eval-core/src/main/java/org/apache/tika/eval/core/metadata/TikaEvalMetadataFilter.java * (edit) tika-langdetect/tika-langdetect-lingo24/src/main/java/org/apache/tika/langdetect/lingo24/Lingo24LangDetector.java * (edit) tika-eval/tika-eval-core/src/main/java/org/apache/tika/eval/core/tokens/CommonTokenCountManager.java * (edit) tika-batch/src/main/java/org/apache/tika/batch/fs/FSOutputStreamFactory.java * (edit) tika-pipes/tika-fetch-iterators/tika-fetch-iterator-s3/pom.xml * (edit) tika-langdetect/tika-langdetect-opennlp/pom.xml * (edit) tika-example/src/main/java/org/apache/tika/example/LanguageDetectorExample.java * (delete) tika-batch/src/test/resources/log4j_process.properties * (edit) tika-example/src/main/java/org/apache/tika/example/DumpTikaConfigExample.java * (edit) tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/io/ExtractReaderException.java * (edit) tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-text-module/pom.xml * (add) tika-langdetect/tika-langdetect-mitll-text/src/test/resources/log4j2.properties * (edit) tika-pipes/tika-fetch-iterators/tika-fetch-iterator-jdbc/pom.xml * (edit) tika-example/src/test/java/org/apache/tika/example/ContentHandlerExampleTest.java * (edit) tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-miscoffice-module/pom.xml * (edit) tika-eval/tika-eval-app/src/test/java/org/apache/tika/eval/app/db/AbstractBufferTest.java * (edit) tika-eval/tik
[jira] [Commented] (TIKA-3364) PDF Content is extracted twice
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330810#comment-17330810 ] Tim Allison commented on TIKA-3364: --- We should probably add extra markup in the xhtml to identify the outlines/bookmarks ... > PDF Content is extracted twice > -- > > Key: TIKA-3364 > URL: https://issues.apache.org/jira/browse/TIKA-3364 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.26 >Reporter: David Pilato >Priority: Major > Attachments: Screenshot from 2021-04-23 10-15-22.png, issue-1097.pdf, > tika-bookmarks-config.xml > > > Hi > Coming from [this issue in FSCrawler > project|https://github.com/dadoonet/fscrawler/issues/1097], I can see that > the text from the PDF document is extracted more than once although PDFBox > seems to extract it only once. > I attached the PDF. > When I run: > {code:sh} > wget https://downloads.apache.org/pdfbox/2.0.23/pdfbox-app-2.0.23.jar > java -jar pdfbox-app-2.0.23.jar ExtractText -console issue-1097.pdf > {code} > I'm getting: > {code:sh} > Dummy PDF file > {code} > But with Tika: > {code:sh} > wget https://downloads.apache.org/tika/tika-app-1.26.jar > java -jar tika-app-1.26.jar > {code} > I'm getting: > {code:xml} > xmlns="http://www.w3.org/1999/xhtml";> > > > > > > > > > > > > > > > > > > > > > > > > content="3df79d34abbca99308e79cb94461c1893582604d68329a41fd4bec1885e6adb4"/> > > > > > > > > > > > > > > > > > Dummy PDF file > > > Dummy PDF file > > > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3364) PDF Content is extracted twice
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330809#comment-17330809 ] Tim Allison commented on TIKA-3364: --- You can see the text under the {{Outlines}} node. > PDF Content is extracted twice > -- > > Key: TIKA-3364 > URL: https://issues.apache.org/jira/browse/TIKA-3364 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.26 >Reporter: David Pilato >Priority: Major > Attachments: Screenshot from 2021-04-23 10-15-22.png, issue-1097.pdf, > tika-bookmarks-config.xml > > > Hi > Coming from [this issue in FSCrawler > project|https://github.com/dadoonet/fscrawler/issues/1097], I can see that > the text from the PDF document is extracted more than once although PDFBox > seems to extract it only once. > I attached the PDF. > When I run: > {code:sh} > wget https://downloads.apache.org/pdfbox/2.0.23/pdfbox-app-2.0.23.jar > java -jar pdfbox-app-2.0.23.jar ExtractText -console issue-1097.pdf > {code} > I'm getting: > {code:sh} > Dummy PDF file > {code} > But with Tika: > {code:sh} > wget https://downloads.apache.org/tika/tika-app-1.26.jar > java -jar tika-app-1.26.jar > {code} > I'm getting: > {code:xml} > xmlns="http://www.w3.org/1999/xhtml";> > > > > > > > > > > > > > > > > > > > > > > > > content="3df79d34abbca99308e79cb94461c1893582604d68329a41fd4bec1885e6adb4"/> > > > > > > > > > > > > > > > > > Dummy PDF file > > > Dummy PDF file > > > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (TIKA-3364) PDF Content is extracted twice
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-3364: -- Attachment: Screenshot from 2021-04-23 10-15-22.png > PDF Content is extracted twice > -- > > Key: TIKA-3364 > URL: https://issues.apache.org/jira/browse/TIKA-3364 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.26 >Reporter: David Pilato >Priority: Major > Attachments: Screenshot from 2021-04-23 10-15-22.png, issue-1097.pdf, > tika-bookmarks-config.xml > > > Hi > Coming from [this issue in FSCrawler > project|https://github.com/dadoonet/fscrawler/issues/1097], I can see that > the text from the PDF document is extracted more than once although PDFBox > seems to extract it only once. > I attached the PDF. > When I run: > {code:sh} > wget https://downloads.apache.org/pdfbox/2.0.23/pdfbox-app-2.0.23.jar > java -jar pdfbox-app-2.0.23.jar ExtractText -console issue-1097.pdf > {code} > I'm getting: > {code:sh} > Dummy PDF file > {code} > But with Tika: > {code:sh} > wget https://downloads.apache.org/tika/tika-app-1.26.jar > java -jar tika-app-1.26.jar > {code} > I'm getting: > {code:xml} > xmlns="http://www.w3.org/1999/xhtml";> > > > > > > > > > > > > > > > > > > > > > > > > content="3df79d34abbca99308e79cb94461c1893582604d68329a41fd4bec1885e6adb4"/> > > > > > > > > > > > > > > > > > Dummy PDF file > > > Dummy PDF file > > > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (TIKA-3364) PDF Content is extracted twice
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330805#comment-17330805 ] Tim Allison edited comment on TIKA-3364 at 4/23/21, 2:13 PM: - With the attached config file, I get this: {noformat} Dummy PDF file {noformat} was (Author: talli...@mitre.org): {noformat} Dummy PDF file {noformat} > PDF Content is extracted twice > -- > > Key: TIKA-3364 > URL: https://issues.apache.org/jira/browse/TIKA-3364 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.26 >Reporter: David Pilato >Priority: Major > Attachments: issue-1097.pdf, tika-bookmarks-config.xml > > > Hi > Coming from [this issue in FSCrawler > project|https://github.com/dadoonet/fscrawler/issues/1097], I can see that > the text from the PDF document is extracted more than once although PDFBox > seems to extract it only once. > I attached the PDF. > When I run: > {code:sh} > wget https://downloads.apache.org/pdfbox/2.0.23/pdfbox-app-2.0.23.jar > java -jar pdfbox-app-2.0.23.jar ExtractText -console issue-1097.pdf > {code} > I'm getting: > {code:sh} > Dummy PDF file > {code} > But with Tika: > {code:sh} > wget https://downloads.apache.org/tika/tika-app-1.26.jar > java -jar tika-app-1.26.jar > {code} > I'm getting: > {code:xml} > xmlns="http://www.w3.org/1999/xhtml";> > > > > > > > > > > > > > > > > > > > > > > > > content="3df79d34abbca99308e79cb94461c1893582604d68329a41fd4bec1885e6adb4"/> > > > > > > > > > > > > > > > > > Dummy PDF file > > > Dummy PDF file > > > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3364) PDF Content is extracted twice
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330805#comment-17330805 ] Tim Allison commented on TIKA-3364: --- {noformat} Dummy PDF file {noformat} > PDF Content is extracted twice > -- > > Key: TIKA-3364 > URL: https://issues.apache.org/jira/browse/TIKA-3364 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.26 >Reporter: David Pilato >Priority: Major > Attachments: issue-1097.pdf, tika-bookmarks-config.xml > > > Hi > Coming from [this issue in FSCrawler > project|https://github.com/dadoonet/fscrawler/issues/1097], I can see that > the text from the PDF document is extracted more than once although PDFBox > seems to extract it only once. > I attached the PDF. > When I run: > {code:sh} > wget https://downloads.apache.org/pdfbox/2.0.23/pdfbox-app-2.0.23.jar > java -jar pdfbox-app-2.0.23.jar ExtractText -console issue-1097.pdf > {code} > I'm getting: > {code:sh} > Dummy PDF file > {code} > But with Tika: > {code:sh} > wget https://downloads.apache.org/tika/tika-app-1.26.jar > java -jar tika-app-1.26.jar > {code} > I'm getting: > {code:xml} > xmlns="http://www.w3.org/1999/xhtml";> > > > > > > > > > > > > > > > > > > > > > > > > content="3df79d34abbca99308e79cb94461c1893582604d68329a41fd4bec1885e6adb4"/> > > > > > > > > > > > > > > > > > Dummy PDF file > > > Dummy PDF file > > > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (TIKA-3364) PDF Content is extracted twice
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-3364: -- Attachment: tika-bookmarks-config.xml > PDF Content is extracted twice > -- > > Key: TIKA-3364 > URL: https://issues.apache.org/jira/browse/TIKA-3364 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.26 >Reporter: David Pilato >Priority: Major > Attachments: issue-1097.pdf, tika-bookmarks-config.xml > > > Hi > Coming from [this issue in FSCrawler > project|https://github.com/dadoonet/fscrawler/issues/1097], I can see that > the text from the PDF document is extracted more than once although PDFBox > seems to extract it only once. > I attached the PDF. > When I run: > {code:sh} > wget https://downloads.apache.org/pdfbox/2.0.23/pdfbox-app-2.0.23.jar > java -jar pdfbox-app-2.0.23.jar ExtractText -console issue-1097.pdf > {code} > I'm getting: > {code:sh} > Dummy PDF file > {code} > But with Tika: > {code:sh} > wget https://downloads.apache.org/tika/tika-app-1.26.jar > java -jar tika-app-1.26.jar > {code} > I'm getting: > {code:xml} > xmlns="http://www.w3.org/1999/xhtml";> > > > > > > > > > > > > > > > > > > > > > > > > content="3df79d34abbca99308e79cb94461c1893582604d68329a41fd4bec1885e6adb4"/> > > > > > > > > > > > > > > > > > Dummy PDF file > > > Dummy PDF file > > > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3364) PDF Content is extracted twice
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330799#comment-17330799 ] Tim Allison commented on TIKA-3364: --- The PDF contains bookmark text, which is what is triggering the . You can configure Tika not to extract bookmark text with "extractBookmarksText" with something like: {noformat} false > PDF Content is extracted twice > -- > > Key: TIKA-3364 > URL: https://issues.apache.org/jira/browse/TIKA-3364 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.26 >Reporter: David Pilato >Priority: Major > Attachments: issue-1097.pdf > > > Hi > Coming from [this issue in FSCrawler > project|https://github.com/dadoonet/fscrawler/issues/1097], I can see that > the text from the PDF document is extracted more than once although PDFBox > seems to extract it only once. > I attached the PDF. > When I run: > {code:sh} > wget https://downloads.apache.org/pdfbox/2.0.23/pdfbox-app-2.0.23.jar > java -jar pdfbox-app-2.0.23.jar ExtractText -console issue-1097.pdf > {code} > I'm getting: > {code:sh} > Dummy PDF file > {code} > But with Tika: > {code:sh} > wget https://downloads.apache.org/tika/tika-app-1.26.jar > java -jar tika-app-1.26.jar > {code} > I'm getting: > {code:xml} > xmlns="http://www.w3.org/1999/xhtml";> > > > > > > > > > > > > > > > > > > > > > > > > content="3df79d34abbca99308e79cb94461c1893582604d68329a41fd4bec1885e6adb4"/> > > > > > > > > > > > > > > > > > Dummy PDF file > > > Dummy PDF file > > > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (TIKA-3364) PDF Content is extracted twice
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330799#comment-17330799 ] Tim Allison edited comment on TIKA-3364 at 4/23/21, 2:08 PM: - The PDF contains bookmark text, which is what is triggering the . You can configure Tika not to extract bookmark text with "extractBookmarksText" with something like: {noformat} false {noformat} was (Author: talli...@mitre.org): The PDF contains bookmark text, which is what is triggering the . You can configure Tika not to extract bookmark text with "extractBookmarksText" with something like: {noformat} false {noformat} > PDF Content is extracted twice > -- > > Key: TIKA-3364 > URL: https://issues.apache.org/jira/browse/TIKA-3364 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.26 >Reporter: David Pilato >Priority: Major > Attachments: issue-1097.pdf > > > Hi > Coming from [this issue in FSCrawler > project|https://github.com/dadoonet/fscrawler/issues/1097], I can see that > the text from the PDF document is extracted more than once although PDFBox > seems to extract it only once. > I attached the PDF. > When I run: > {code:sh} > wget https://downloads.apache.org/pdfbox/2.0.23/pdfbox-app-2.0.23.jar > java -jar pdfbox-app-2.0.23.jar ExtractText -console issue-1097.pdf > {code} > I'm getting: > {code:sh} > Dummy PDF file > {code} > But with Tika: > {code:sh} > wget https://downloads.apache.org/tika/tika-app-1.26.jar > java -jar tika-app-1.26.jar > {code} > I'm getting: > {code:xml} > xmlns="http://www.w3.org/1999/xhtml";> > > > > > > > > > > > > > > > > > > > > > > > > content="3df79d34abbca99308e79cb94461c1893582604d68329a41fd4bec1885e6adb4"/> > > > > > > > > > > > > > > > > > Dummy PDF file > > > Dummy PDF file > > > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (TIKA-3364) PDF Content is extracted twice
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330799#comment-17330799 ] Tim Allison edited comment on TIKA-3364 at 4/23/21, 2:08 PM: - The PDF contains bookmark text, which is what is triggering the . You can configure Tika not to extract bookmark text with "extractBookmarksText" with something like: {noformat} false {noformat} was (Author: talli...@mitre.org): The PDF contains bookmark text, which is what is triggering the . You can configure Tika not to extract bookmark text with "extractBookmarksText" with something like: {noformat} false > PDF Content is extracted twice > -- > > Key: TIKA-3364 > URL: https://issues.apache.org/jira/browse/TIKA-3364 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.26 >Reporter: David Pilato >Priority: Major > Attachments: issue-1097.pdf > > > Hi > Coming from [this issue in FSCrawler > project|https://github.com/dadoonet/fscrawler/issues/1097], I can see that > the text from the PDF document is extracted more than once although PDFBox > seems to extract it only once. > I attached the PDF. > When I run: > {code:sh} > wget https://downloads.apache.org/pdfbox/2.0.23/pdfbox-app-2.0.23.jar > java -jar pdfbox-app-2.0.23.jar ExtractText -console issue-1097.pdf > {code} > I'm getting: > {code:sh} > Dummy PDF file > {code} > But with Tika: > {code:sh} > wget https://downloads.apache.org/tika/tika-app-1.26.jar > java -jar tika-app-1.26.jar > {code} > I'm getting: > {code:xml} > xmlns="http://www.w3.org/1999/xhtml";> > > > > > > > > > > > > > > > > > > > > > > > > content="3df79d34abbca99308e79cb94461c1893582604d68329a41fd4bec1885e6adb4"/> > > > > > > > > > > > > > > > > > Dummy PDF file > > > Dummy PDF file > > > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (TIKA-3364) PDF Content is extracted twice
David Pilato created TIKA-3364: -- Summary: PDF Content is extracted twice Key: TIKA-3364 URL: https://issues.apache.org/jira/browse/TIKA-3364 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.26 Reporter: David Pilato Attachments: issue-1097.pdf Hi Coming from [this issue in FSCrawler project|https://github.com/dadoonet/fscrawler/issues/1097], I can see that the text from the PDF document is extracted more than once although PDFBox seems to extract it only once. I attached the PDF. When I run: {code:sh} wget https://downloads.apache.org/pdfbox/2.0.23/pdfbox-app-2.0.23.jar java -jar pdfbox-app-2.0.23.jar ExtractText -console issue-1097.pdf {code} I'm getting: {code:sh} Dummy PDF file {code} But with Tika: {code:sh} wget https://downloads.apache.org/tika/tika-app-1.26.jar java -jar tika-app-1.26.jar {code} I'm getting: {code:xml} http://www.w3.org/1999/xhtml";> Dummy PDF file Dummy PDF file {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)