date:20210423

[jira] [Closed] (TIKA-3149) Tikka 1.18 not working with tess4j 3.4.8 on linux

2021-04-23 Thread Konstantin Gribov (Jira)



 [ 
https://issues.apache.org/jira/browse/TIKA-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Gribov closed TIKA-3149.
---

> Tikka 1.18 not working with tess4j 3.4.8 on linux
> -
>
> Key: TIKA-3149
> URL: https://issues.apache.org/jira/browse/TIKA-3149
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.18
> Environment: linux and deployedo n weblogic
>Reporter: Vishakha 
>Assignee: Konstantin Gribov
>Priority: Blocker
>  Labels: starter
>
> I am using tikka 1.18 version to parse the docuemtn content. It is working 
> independently when deployed on linux but it is not working. If tessract is 
> used before it. It is giving below error while parseTostring 
> code : 
> Tika tika = new Tika();Tika tika = new Tika();
> try(InputStream stream = new 
> FileInputStream(Paths.get(documentPath.concat(documentName)).toAbsolutePath().toString()))
>  { String documentExt = 
> tika.detect(Paths.get(documentPath.concat(documentName)).toAbsolutePath().toString());
> String outputStr = tika.parseToString(stream);
> String tempStr = outputStr.replace("\n", ""); _Logger.info("tempStr: " 
> +tempStr); }
> catch (TikaException e) \{
>  // TODO Auto-generated catch block _Logger.error("Error :",e); }
> Error as :
> java.lang.StackOverflowError
>   at 
> org.slf4j.impl.JDK14LoggerAdapter.fillCallerData(JDK14LoggerAdapter.java:602)
>   at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:587)
>   at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660)
>   at 
> org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221)
>   at 
> org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303)
>   at java.util.logging.Logger.log(Logger.java:738)
>   at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588)
>   at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660)
>   at 
> org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221)
>   at 
> org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303)
>   at java.util.logging.Logger.log(Logger.java:738)
>   at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588)
>   at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660)
>   at 
> org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221)
>   at 
> org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303)
>   at java.util.logging.Logger.log(Logger.java:738)
>   at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588)
>   at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660)
>   at 
> org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221)
>   at 
> org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303)
>   at java.util.logging.Logger.log(Logger.java:738)
> ...
> > 
> kindly let us know the solution



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (TIKA-3149) Tikka 1.18 not working with tess4j 3.4.8 on linux

2021-04-23 Thread Konstantin Gribov (Jira)



 [ 
https://issues.apache.org/jira/browse/TIKA-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Gribov updated TIKA-3149:

Description: 
I am using tikka 1.18 version to parse the docuemtn content. It is working 
independently when deployed on linux but it is not working. If tessract is used 
before it. It is giving below error while parseTostring 

code : 

Tika tika = new Tika();Tika tika = new Tika();

try(InputStream stream = new 
FileInputStream(Paths.get(documentPath.concat(documentName)).toAbsolutePath().toString()))
 { String documentExt = 
tika.detect(Paths.get(documentPath.concat(documentName)).toAbsolutePath().toString());

String outputStr = tika.parseToString(stream);

String tempStr = outputStr.replace("\n", ""); _Logger.info("tempStr: " 
+tempStr); }

catch (TikaException e) \{
 // TODO Auto-generated catch block _Logger.error("Error :",e); }


Error as :
java.lang.StackOverflowError
at 
org.slf4j.impl.JDK14LoggerAdapter.fillCallerData(JDK14LoggerAdapter.java:602)
at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:587)
at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660)
at 
org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221)
at 
org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303)
at java.util.logging.Logger.log(Logger.java:738)
at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588)
at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660)
at 
org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221)
at 
org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303)
at java.util.logging.Logger.log(Logger.java:738)
at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588)
at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660)
at 
org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221)
at 
org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303)
at java.util.logging.Logger.log(Logger.java:738)
at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588)
at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660)
at 
org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221)
at 
org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303)
at java.util.logging.Logger.log(Logger.java:738)
...
> 


kindly let us know the solution

  was:
I am using tikka 1.18 version to parse the docuemtn content. It is working 
independently when deployed on linux but it is not working. If tessract is used 
before it. It is giving below error while parseTostring 

code : 

Tika tika = new Tika();Tika tika = new Tika();

try(InputStream stream = new 
FileInputStream(Paths.get(documentPath.concat(documentName)).toAbsolutePath().toString()))
 { String documentExt = 
tika.detect(Paths.get(documentPath.concat(documentName)).toAbsolutePath().toString());

String outputStr = tika.parseToString(stream);

String tempStr = outputStr.replace("\n", ""); _Logger.info("tempStr: " 
+tempStr); }

catch (TikaException e) \{
 // TODO Auto-generated catch block _Logger.error("Error :",e); }


Error as :
java.lang.StackOverflowError
at 
org.slf4j.impl.JDK14LoggerAdapter.fillCallerData(JDK14LoggerAdapter.java:602)
at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:587)
at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660)
at 
org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221)
at 
org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303)
at java.util.logging.Logger.log(Logger.java:738)
at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588)
at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660)
at 
org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221)
at 
org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303)
at java.util.logging.Logger.log(Logger.java:738)
at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588)
at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660)
at 
org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221)
at 
org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303)
at java.util.logging.Logger.log(Logger.java:738)
at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588)
at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660)
at 
org.slf4j.bridge.SLF4JBridgeHandler.cal

[jira] [Resolved] (TIKA-3149) Tikka 1.18 not working with tess4j 3.4.8 on linux

2021-04-23 Thread Konstantin Gribov (Jira)



 [ 
https://issues.apache.org/jira/browse/TIKA-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Gribov resolved TIKA-3149.
-
  Assignee: Konstantin Gribov
Resolution: Not A Bug

> Tikka 1.18 not working with tess4j 3.4.8 on linux
> -
>
> Key: TIKA-3149
> URL: https://issues.apache.org/jira/browse/TIKA-3149
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.18
> Environment: linux and deployedo n weblogic
>Reporter: Vishakha 
>Assignee: Konstantin Gribov
>Priority: Blocker
>  Labels: starter
>
> I am using tikka 1.18 version to parse the docuemtn content. It is working 
> independently when deployed on linux but it is not working. If tessract is 
> used before it. It is giving below error while parseTostring 
> code : 
> Tika tika = new Tika();Tika tika = new Tika();
> try(InputStream stream = new 
> FileInputStream(Paths.get(documentPath.concat(documentName)).toAbsolutePath().toString()))
>  { String documentExt = 
> tika.detect(Paths.get(documentPath.concat(documentName)).toAbsolutePath().toString());
> String outputStr = tika.parseToString(stream);
> String tempStr = outputStr.replace("\n", ""); _Logger.info("tempStr: " 
> +tempStr); }
> catch (TikaException e) \{
>  // TODO Auto-generated catch block _Logger.error("Error :",e); }
> Error as :
> java.lang.StackOverflowError
>   at 
> org.slf4j.impl.JDK14LoggerAdapter.fillCallerData(JDK14LoggerAdapter.java:602)
>   at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:587)
>   at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660)
>   at 
> org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221)
>   at 
> org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303)
>   at java.util.logging.Logger.log(Logger.java:738)
>   at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588)
>   at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660)
>   at 
> org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221)
>   at 
> org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303)
>   at java.util.logging.Logger.log(Logger.java:738)
>   at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588)
>   at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660)
>   at 
> org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221)
>   at 
> org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303)
>   at java.util.logging.Logger.log(Logger.java:738)
>   at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588)
>   at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660)
>   at 
> org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221)
>   at 
> org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303)
>   at java.util.logging.Logger.log(Logger.java:738)
> ...
> > 
> kindly let us know the solution



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3149) Tikka 1.18 not working with tess4j 3.4.8 on linux

2021-04-23 Thread Konstantin Gribov (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17331122#comment-17331122
 ] 

Konstantin Gribov commented on TIKA-3149:
-

You have both slf4j-jdk14 (logger implementation using java.util.Logging) and 
jul-to-slf4j (bridge to redirect java.util.Logging to slf4j-api). I recommend 
to drop slf4j-jdk14 from classpath and use any other logging implementation 
(logback-classic, log4j2).

> Tikka 1.18 not working with tess4j 3.4.8 on linux
> -
>
> Key: TIKA-3149
> URL: https://issues.apache.org/jira/browse/TIKA-3149
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.18
> Environment: linux and deployedo n weblogic
>Reporter: Vishakha 
>Priority: Blocker
>  Labels: starter
>
> I am using tikka 1.18 version to parse the docuemtn content. It is working 
> independently when deployed on linux but it is not working. If tessract is 
> used before it. It is giving below error while parseTostring 
> code : 
> Tika tika = new Tika();Tika tika = new Tika();
> try(InputStream stream = new 
> FileInputStream(Paths.get(documentPath.concat(documentName)).toAbsolutePath().toString()))
>  { String documentExt = 
> tika.detect(Paths.get(documentPath.concat(documentName)).toAbsolutePath().toString());
> String outputStr = tika.parseToString(stream);
> String tempStr = outputStr.replace("\n", ""); _Logger.info("tempStr: " 
> +tempStr); }
> catch (TikaException e) \{
>  // TODO Auto-generated catch block _Logger.error("Error :",e); }
> Error as :
> java.lang.StackOverflowError
>   at 
> org.slf4j.impl.JDK14LoggerAdapter.fillCallerData(JDK14LoggerAdapter.java:602)
>   at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:587)
>   at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660)
>   at 
> org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221)
>   at 
> org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303)
>   at java.util.logging.Logger.log(Logger.java:738)
>   at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588)
>   at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660)
>   at 
> org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221)
>   at 
> org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303)
>   at java.util.logging.Logger.log(Logger.java:738)
>   at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588)
>   at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660)
>   at 
> org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221)
>   at 
> org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303)
>   at java.util.logging.Logger.log(Logger.java:738)
>   at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588)
>   at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660)
>   at 
> org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221)
>   at 
> org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303)
>   at java.util.logging.Logger.log(Logger.java:738)
>   at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588)
>   at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660)
>   at 
> org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221)
>   at 
> org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303)
>   at java.util.logging.Logger.log(Logger.java:738)
>   at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588)
>   at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660)
>   at 
> org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221)
>   at 
> org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303)
>   at java.util.logging.Logger.log(Logger.java:738)
>   at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588)
>   at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660)
>   at 
> org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221)
>   at 
> org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303)
>   at java.util.logging.Logger.log(Logger.java:738)
>   at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:588)
>   at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:660)
>   at 
> org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221)
>   at 
> org.slf4j.bridge.

[jira] [Updated] (TIKA-3369) Flaky Tesseract OCR confirmMultiPageTiffHandling test

2021-04-23 Thread Konstantin Gribov (Jira)



 [ 
https://issues.apache.org/jira/browse/TIKA-3369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Gribov updated TIKA-3369:

Description: 
Current main@08793d360a838db04a3d23b902c34d9e6e7362e4 fails with

{noformat}
[ERROR]   
TesseractOCRParserTest.confirmMultiPageTiffHandling:108->TikaTest.assertContains:79
 Page 2 not found in:
http://www.w3.org/1999/xhtml";>






Multipage
TIFF
Example
Page 1
Multipage
TIFF
Example
Page?2


{noformat}

Take note that tesseract extract {{Page?2}} instead of {{Page 2}}.

  was:
Current main@08793d360a838db04a3d23b902c34d9e6e7362e4 fails with

{noformat}
[ERROR]   
TesseractOCRParserTest.confirmMultiPageTiffHandling:108->TikaTest.assertContains:79
 Page 2 not found in:
http://www.w3.org/1999/xhtml";>






Multipage
TIFF
Example
Page 1
Multipage
TIFF
Example
Page?2


{noformat}



> Flaky Tesseract OCR confirmMultiPageTiffHandling test
> -
>
> Key: TIKA-3369
> URL: https://issues.apache.org/jira/browse/TIKA-3369
> Project: Tika
>  Issue Type: Test
>  Components: ocr
>Affects Versions: 2.0.0
> Environment: Arch Linux, kernel: 5.11.16-arch1-1 #1 SMP PREEMPT Wed, 
> 21 Apr 2021 17:22:13 + x86_64 GNU/Linux
> OpenJDK 15.0.2.u7-1
> Tesseract 4.1.1-5 with icu 69.1-1, cairo 1.17.4-5, pango 1:1.48.4-1, 
> tesseract-data-{eng,deu,fra,rus,ukr} 2:4.0.0-1 (other languages not installed)
>Reporter: Konstantin Gribov
>Priority: Minor
>
> Current main@08793d360a838db04a3d23b902c34d9e6e7362e4 fails with
> {noformat}
> [ERROR]   
> TesseractOCRParserTest.confirmMultiPageTiffHandling:108->TikaTest.assertContains:79
>  Page 2 not found in:
> http://www.w3.org/1999/xhtml";>
> 
> 
>  />
>  content="org.apache.tika.parser.ocr.TesseractOCRParser" />
> 
> 
> Multipage
> TIFF
> Example
> Page 1
> Multipage
> TIFF
> Example
> Page?2
> 
> 
> {noformat}
> Take note that tesseract extract {{Page?2}} instead of {{Page 2}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (TIKA-3369) Flaky Tesseract OCR confirmMultiPageTiffHandling test

2021-04-23 Thread Konstantin Gribov (Jira)

Konstantin Gribov created TIKA-3369:
---

 Summary: Flaky Tesseract OCR confirmMultiPageTiffHandling test
 Key: TIKA-3369
 URL: https://issues.apache.org/jira/browse/TIKA-3369
 Project: Tika
  Issue Type: Test
  Components: ocr
Affects Versions: 2.0.0
 Environment: Arch Linux, kernel: 5.11.16-arch1-1 #1 SMP PREEMPT Wed, 
21 Apr 2021 17:22:13 + x86_64 GNU/Linux
OpenJDK 15.0.2.u7-1
Tesseract 4.1.1-5 with icu 69.1-1, cairo 1.17.4-5, pango 1:1.48.4-1, 
tesseract-data-{eng,deu,fra,rus,ukr} 2:4.0.0-1 (other languages not installed)

Reporter: Konstantin Gribov


Current main@08793d360a838db04a3d23b902c34d9e6e7362e4 fails with

{noformat}
[ERROR]   
TesseractOCRParserTest.confirmMultiPageTiffHandling:108->TikaTest.assertContains:79
 Page 2 not found in:
http://www.w3.org/1999/xhtml";>






Multipage
TIFF
Example
Page 1
Multipage
TIFF
Example
Page?2


{noformat}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[RFC] Tika BOMs/platforms

2021-04-23 Thread Konstantin Gribov

Hi, folks.

I hope for comments and kind of lazy consensus. If there would be no
objections I'll merge it to main and branch_1x.

I created tika-bom modules with bill-of-materials (in Apache Maven
terminology) / platform (for Gradle users). It will allow easy Tika module
versions alignment and to write Tika it once when importing BOM.

Downside is that adding a new downstream-consumable module (like another
parser one) will require to not forget to add it to tika-bom.

BOMs are both for 2.x [1, 2] and 1.x [3, 4] branches.

In case of 1.x tika-bom includes almost all artifacts useful as
dependencies in downstream projects. So Tika end user applications like
tika-server and tika-app shouldn't be included.

For 2.x mostly the same (but much more modules xD). Also included artifacts
like eval-core, server-core etc. Tika Pipes are in separate bom (tika-pipes
pom module itself).

[1]: https://issues.apache.org/jira/browse/TIKA-3367
[2]: https://github.com/apache/tika/pull/431
[3]: https://issues.apache.org/jira/browse/TIKA-3368
[4]: https://github.com/apache/tika/pull/432

-- 
Best regards,
Konstantin Gribov.

[jira] [Commented] (TIKA-3368) Add Bill of Materials (BOM) artifact (Tika 1.x)

2021-04-23 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17331107#comment-17331107
 ] 

ASF GitHub Bot commented on TIKA-3368:
--

grossws opened a new pull request #432:
URL: https://github.com/apache/tika/pull/432


   Fixes #TIKA-3368


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add Bill of Materials (BOM) artifact (Tika 1.x)
> ---
>
> Key: TIKA-3368
> URL: https://issues.apache.org/jira/browse/TIKA-3368
> Project: Tika
>  Issue Type: Improvement
>  Components: packaging
>Reporter: Konstantin Gribov
>Assignee: Konstantin Gribov
>Priority: Major
> Fix For: 1.27
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [tika] grossws opened a new pull request #432: [TIKA-3368] Add tika-bom module

2021-04-23 Thread GitBox



grossws opened a new pull request #432:
URL: https://github.com/apache/tika/pull/432


   Fixes #TIKA-3368


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (TIKA-3367) Add Bill of Materials (BOM) artifact

2021-04-23 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17331102#comment-17331102
 ] 

ASF GitHub Bot commented on TIKA-3367:
--

grossws opened a new pull request #431:
URL: https://github.com/apache/tika/pull/431


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add Bill of Materials (BOM) artifact
> 
>
> Key: TIKA-3367
> URL: https://issues.apache.org/jira/browse/TIKA-3367
> Project: Tika
>  Issue Type: Improvement
>  Components: packaging
>Reporter: Konstantin Gribov
>Assignee: Konstantin Gribov
>Priority: Major
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [tika] grossws opened a new pull request #431: [TIKA-3367] Add Bill of Materials (BOM)

2021-04-23 Thread GitBox



grossws opened a new pull request #431:
URL: https://github.com/apache/tika/pull/431


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (TIKA-3368) Add Bill of Materials (BOM) artifact (Tika 1.x)

2021-04-23 Thread Konstantin Gribov (Jira)

Konstantin Gribov created TIKA-3368:
---

 Summary: Add Bill of Materials (BOM) artifact (Tika 1.x)
 Key: TIKA-3368
 URL: https://issues.apache.org/jira/browse/TIKA-3368
 Project: Tika
  Issue Type: Improvement
  Components: packaging
Reporter: Konstantin Gribov
Assignee: Konstantin Gribov
 Fix For: 1.27






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (TIKA-3367) Add Bill of Materials (BOM) artifact

2021-04-23 Thread Konstantin Gribov (Jira)



 [ 
https://issues.apache.org/jira/browse/TIKA-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Gribov updated TIKA-3367:

Fix Version/s: (was: 1.27)

> Add Bill of Materials (BOM) artifact
> 
>
> Key: TIKA-3367
> URL: https://issues.apache.org/jira/browse/TIKA-3367
> Project: Tika
>  Issue Type: Improvement
>  Components: packaging
>Reporter: Konstantin Gribov
>Assignee: Konstantin Gribov
>Priority: Major
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (TIKA-3367) Add Bill of Materials (BOM) artifact

2021-04-23 Thread Konstantin Gribov (Jira)

Konstantin Gribov created TIKA-3367:
---

 Summary: Add Bill of Materials (BOM) artifact
 Key: TIKA-3367
 URL: https://issues.apache.org/jira/browse/TIKA-3367
 Project: Tika
  Issue Type: Improvement
  Components: packaging
Reporter: Konstantin Gribov
Assignee: Konstantin Gribov
 Fix For: 2.0.0, 1.27






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (TIKA-3363) Have tika-docker artifacts start in spawn mode (configurable)

2021-04-23 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/TIKA-3363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved TIKA-3363.

Fix Version/s: (was: 1.27)
   Resolution: Won't Fix

> Have tika-docker artifacts start in spawn mode (configurable)
> -
>
> Key: TIKA-3363
> URL: https://issues.apache.org/jira/browse/TIKA-3363
> Project: Tika
>  Issue Type: Improvement
>  Components: docker, tika-docker
>Affects Versions: 1.26
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
>
> I would like to poll tika-docker users as to whether we should turn on by 
> default, but make configurable (build and deploy overrides) use of the 
> *-spawnChild* flag.
> See 
> https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-MakingTikaServerRobusttoOOMs,InfiniteLoopsandMemoryLeaks
>  for more documentation on this topic.
> Right now it is impossible to configure this in tika-helm unless it is 
> configurable from the tika-docker artifact. Thank you
> [~dmeikle] FYI



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (TIKA-3363) Have tika-docker artifacts start in spawn mode (configurable)

2021-04-23 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/TIKA-3363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed TIKA-3363.
--

> Have tika-docker artifacts start in spawn mode (configurable)
> -
>
> Key: TIKA-3363
> URL: https://issues.apache.org/jira/browse/TIKA-3363
> Project: Tika
>  Issue Type: Improvement
>  Components: docker, tika-docker
>Affects Versions: 1.26
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
>
> I would like to poll tika-docker users as to whether we should turn on by 
> default, but make configurable (build and deploy overrides) use of the 
> *-spawnChild* flag.
> See 
> https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-MakingTikaServerRobusttoOOMs,InfiniteLoopsandMemoryLeaks
>  for more documentation on this topic.
> Right now it is impossible to configure this in tika-helm unless it is 
> configurable from the tika-docker artifact. Thank you
> [~dmeikle] FYI



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (TIKA-3366) Retrospective release of tika-docker 2.0.0-ALPHA

2021-04-23 Thread Lewis John McGibbney (Jira)

Lewis John McGibbney created TIKA-3366:
--

 Summary: Retrospective release of tika-docker 2.0.0-ALPHA
 Key: TIKA-3366
 URL: https://issues.apache.org/jira/browse/TIKA-3366
 Project: Tika
  Issue Type: Improvement
  Components: docker
Affects Versions: 2.0.0-ALPHA
Reporter: Lewis John McGibbney
 Fix For: 2.0.0-ALPHA


I recently created TIKA-3363 with the goal of making spawnChild mode 
configurable from within the docker image.

I just discovered that tika 2.0.0 (main) introduces 
[tika-server-config-default.xml|https://github.com/apache/tika/blob/main/tika-server/tika-server-core/src/main/resources/tika-server-config-default.xml#L54-L63]
 which implements the equivalent of spawnChild mode by default.

Additionally, for those that wish to configure this further, by mounting this 
custom configuration as a volume, they can pass the [command line parameter to 
load it|https://github.com/apache/tika-docker#custom-config].

This ticket therefore supersedes TIKA-3363. Ultimately we will need to wait for 
the release of 2.0.0 proper before this configuration comes activated by 
default.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[INVITATION] Apache Tika container orchestration meetup

2021-04-23 Thread lewis john mcgibbney

Hi Folks,
If you are interested in participating in a mini meetup based around Apache
Tika container orchestration then please indicate your preferred
availability at the Doodle Poll below.
This community meetup focuses on Tika container orchestration (Docker,
Docker Compose, Helm, Kubernetes, etc.). Anyone is invited to join :)
https://doodle.com/poll/zf3kfhmn5b7626kk?utm_source=poll&utm_medium=link
Thank you
lewismc


-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc

[jira] [Comment Edited] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread David Pilato (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330882#comment-17330882
 ] 

David Pilato edited comment on TIKA-3364 at 4/23/21, 4:05 PM:
--

Oh my god! I'm feeling stupid.

Anyway, I was not able to choose this method as [it's not a public 
one|https://github.com/apache/tika/blob/be8817e7cdc511668ee9dfa7c385a65168a0051a/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java#L502-L505].

But I used

{code:java}
pdfParser.getPDFParserConfig().setExtractBookmarksText(false);
{code}

It works. But I'm still seeing the double text (instead of triple), when OCR is 
on.


was (Author: dadoonet):
Oh my god! I'm feeling stupid.

Anyway, I was not able to choose this method as [it's not a public 
one|https://github.com/apache/tika/blob/be8817e7cdc511668ee9dfa7c385a65168a0051a/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java#L502-L505].

But I used

{code:java}
pdfParser.getPDFParserConfig().setExtractBookmarksText(false);
{code}

It works. I'm still seeing the double text (instead of triple), when OCR is on.

> PDF Content is extracted twice
> --
>
> Key: TIKA-3364
> URL: https://issues.apache.org/jira/browse/TIKA-3364
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.26
>Reporter: David Pilato
>Priority: Major
> Attachments: Screenshot from 2021-04-23 10-15-22.png, issue-1097.pdf, 
> tika-bookmarks-config.xml
>
>
> Hi
> Coming from [this issue in FSCrawler 
> project|https://github.com/dadoonet/fscrawler/issues/1097], I can see that 
> the text from the PDF document is extracted more than once although PDFBox 
> seems to extract it only once.
> I attached the PDF.
> When I run:
> {code:sh}
> wget https://downloads.apache.org/pdfbox/2.0.23/pdfbox-app-2.0.23.jar
> java -jar pdfbox-app-2.0.23.jar ExtractText -console issue-1097.pdf
> {code}
> I'm getting:
> {code:sh}
> Dummy PDF file
> {code}
> But with Tika:
> {code:sh}
> wget https://downloads.apache.org/tika/tika-app-1.26.jar
> java -jar tika-app-1.26.jar
> {code}
> I'm getting:
> {code:xml}
>  xmlns="http://www.w3.org/1999/xhtml";>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  content="3df79d34abbca99308e79cb94461c1893582604d68329a41fd4bec1885e6adb4"/>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Dummy PDF file
> 
> 
>   Dummy PDF file
> 
> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread David Pilato (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330882#comment-17330882
 ] 

David Pilato edited comment on TIKA-3364 at 4/23/21, 4:04 PM:
--

Oh my god! I'm feeling stupid.

Anyway, I was not able to choose this method as [it's not a public 
one|https://github.com/apache/tika/blob/be8817e7cdc511668ee9dfa7c385a65168a0051a/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java#L502-L505].

But I used

{code:java}
pdfParser.getPDFParserConfig().setExtractBookmarksText(false);
{code}

It works. I'm still seeing the double text (instead of triple), when OCR is on.


was (Author: dadoonet):
Oh my god! I'm feeling stupid.

Anyway, I was not able to choose this method as [it's not a public 
one|https://github.com/apache/tika/blob/be8817e7cdc511668ee9dfa7c385a65168a0051a/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java#L502-L505].

But I used

{code:java}
pdfParser.getPDFParserConfig().setExtractBookmarksText(false);
{code}




> PDF Content is extracted twice
> --
>
> Key: TIKA-3364
> URL: https://issues.apache.org/jira/browse/TIKA-3364
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.26
>Reporter: David Pilato
>Priority: Major
> Attachments: Screenshot from 2021-04-23 10-15-22.png, issue-1097.pdf, 
> tika-bookmarks-config.xml
>
>
> Hi
> Coming from [this issue in FSCrawler 
> project|https://github.com/dadoonet/fscrawler/issues/1097], I can see that 
> the text from the PDF document is extracted more than once although PDFBox 
> seems to extract it only once.
> I attached the PDF.
> When I run:
> {code:sh}
> wget https://downloads.apache.org/pdfbox/2.0.23/pdfbox-app-2.0.23.jar
> java -jar pdfbox-app-2.0.23.jar ExtractText -console issue-1097.pdf
> {code}
> I'm getting:
> {code:sh}
> Dummy PDF file
> {code}
> But with Tika:
> {code:sh}
> wget https://downloads.apache.org/tika/tika-app-1.26.jar
> java -jar tika-app-1.26.jar
> {code}
> I'm getting:
> {code:xml}
>  xmlns="http://www.w3.org/1999/xhtml";>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  content="3df79d34abbca99308e79cb94461c1893582604d68329a41fd4bec1885e6adb4"/>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Dummy PDF file
> 
> 
>   Dummy PDF file
> 
> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread David Pilato (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330882#comment-17330882
 ] 

David Pilato edited comment on TIKA-3364 at 4/23/21, 4:03 PM:
--

Oh my god! I'm feeling stupid.

Anyway, I was not able to choose this method as [it's not a public 
one|https://github.com/apache/tika/blob/be8817e7cdc511668ee9dfa7c385a65168a0051a/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java#L502-L505].

But I used

{code:java}
pdfParser.getPDFParserConfig().setExtractBookmarksText(false);

{code}





was (Author: dadoonet):
Oh my god! I'm feeling stupid.

Anyway, I was not able to choose this method as [it's not a public 
one|https://github.com/apache/tika/blob/be8817e7cdc511668ee9dfa7c385a65168a0051a/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java#L502-L505].



> PDF Content is extracted twice
> --
>
> Key: TIKA-3364
> URL: https://issues.apache.org/jira/browse/TIKA-3364
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.26
>Reporter: David Pilato
>Priority: Major
> Attachments: Screenshot from 2021-04-23 10-15-22.png, issue-1097.pdf, 
> tika-bookmarks-config.xml
>
>
> Hi
> Coming from [this issue in FSCrawler 
> project|https://github.com/dadoonet/fscrawler/issues/1097], I can see that 
> the text from the PDF document is extracted more than once although PDFBox 
> seems to extract it only once.
> I attached the PDF.
> When I run:
> {code:sh}
> wget https://downloads.apache.org/pdfbox/2.0.23/pdfbox-app-2.0.23.jar
> java -jar pdfbox-app-2.0.23.jar ExtractText -console issue-1097.pdf
> {code}
> I'm getting:
> {code:sh}
> Dummy PDF file
> {code}
> But with Tika:
> {code:sh}
> wget https://downloads.apache.org/tika/tika-app-1.26.jar
> java -jar tika-app-1.26.jar
> {code}
> I'm getting:
> {code:xml}
>  xmlns="http://www.w3.org/1999/xhtml";>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  content="3df79d34abbca99308e79cb94461c1893582604d68329a41fd4bec1885e6adb4"/>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Dummy PDF file
> 
> 
>   Dummy PDF file
> 
> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread David Pilato (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330882#comment-17330882
 ] 

David Pilato edited comment on TIKA-3364 at 4/23/21, 4:03 PM:
--

Oh my god! I'm feeling stupid.

Anyway, I was not able to choose this method as [it's not a public 
one|https://github.com/apache/tika/blob/be8817e7cdc511668ee9dfa7c385a65168a0051a/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java#L502-L505].

But I used

{code:java}
pdfParser.getPDFParserConfig().setExtractBookmarksText(false);
{code}





was (Author: dadoonet):
Oh my god! I'm feeling stupid.

Anyway, I was not able to choose this method as [it's not a public 
one|https://github.com/apache/tika/blob/be8817e7cdc511668ee9dfa7c385a65168a0051a/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java#L502-L505].

But I used

{code:java}
pdfParser.getPDFParserConfig().setExtractBookmarksText(false);

{code}




> PDF Content is extracted twice
> --
>
> Key: TIKA-3364
> URL: https://issues.apache.org/jira/browse/TIKA-3364
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.26
>Reporter: David Pilato
>Priority: Major
> Attachments: Screenshot from 2021-04-23 10-15-22.png, issue-1097.pdf, 
> tika-bookmarks-config.xml
>
>
> Hi
> Coming from [this issue in FSCrawler 
> project|https://github.com/dadoonet/fscrawler/issues/1097], I can see that 
> the text from the PDF document is extracted more than once although PDFBox 
> seems to extract it only once.
> I attached the PDF.
> When I run:
> {code:sh}
> wget https://downloads.apache.org/pdfbox/2.0.23/pdfbox-app-2.0.23.jar
> java -jar pdfbox-app-2.0.23.jar ExtractText -console issue-1097.pdf
> {code}
> I'm getting:
> {code:sh}
> Dummy PDF file
> {code}
> But with Tika:
> {code:sh}
> wget https://downloads.apache.org/tika/tika-app-1.26.jar
> java -jar tika-app-1.26.jar
> {code}
> I'm getting:
> {code:xml}
>  xmlns="http://www.w3.org/1999/xhtml";>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  content="3df79d34abbca99308e79cb94461c1893582604d68329a41fd4bec1885e6adb4"/>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Dummy PDF file
> 
> 
>   Dummy PDF file
> 
> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread David Pilato (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330882#comment-17330882
 ] 

David Pilato commented on TIKA-3364:


Oh my god! I'm feeling stupid.

Anyway, I was not able to choose this method as [it's not a public 
one|https://github.com/apache/tika/blob/be8817e7cdc511668ee9dfa7c385a65168a0051a/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java#L502-L505].



> PDF Content is extracted twice
> --
>
> Key: TIKA-3364
> URL: https://issues.apache.org/jira/browse/TIKA-3364
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.26
>Reporter: David Pilato
>Priority: Major
> Attachments: Screenshot from 2021-04-23 10-15-22.png, issue-1097.pdf, 
> tika-bookmarks-config.xml
>
>
> Hi
> Coming from [this issue in FSCrawler 
> project|https://github.com/dadoonet/fscrawler/issues/1097], I can see that 
> the text from the PDF document is extracted more than once although PDFBox 
> seems to extract it only once.
> I attached the PDF.
> When I run:
> {code:sh}
> wget https://downloads.apache.org/pdfbox/2.0.23/pdfbox-app-2.0.23.jar
> java -jar pdfbox-app-2.0.23.jar ExtractText -console issue-1097.pdf
> {code}
> I'm getting:
> {code:sh}
> Dummy PDF file
> {code}
> But with Tika:
> {code:sh}
> wget https://downloads.apache.org/tika/tika-app-1.26.jar
> java -jar tika-app-1.26.jar
> {code}
> I'm getting:
> {code:xml}
>  xmlns="http://www.w3.org/1999/xhtml";>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  content="3df79d34abbca99308e79cb94461c1893582604d68329a41fd4bec1885e6adb4"/>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Dummy PDF file
> 
> 
>   Dummy PDF file
> 
> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (TIKA-3365) RTFParser to XMLContentHandler incorrectly interprets en dash.

2021-04-23 Thread Gordon Allen (Jira)

Gordon Allen created TIKA-3365:
--

 Summary: RTFParser to XMLContentHandler incorrectly interprets en 
dash.
 Key: TIKA-3365
 URL: https://issues.apache.org/jira/browse/TIKA-3365
 Project: Tika
  Issue Type: Bug
  Components: handler, parser
Affects Versions: 1.26
 Environment: macOS Catalina 10.15.7

Java version "15" 2020-09-15

Eclipse 2020-12
Reporter: Gordon Allen


If the RTF document contains an en-dash "\endash" the resultant HTML output 
from the handler is "¿¿¿" instead of "–"

Not sure if the issue is in the Parser or Handler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread Tim Allison (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330851#comment-17330851
 ] 

Tim Allison commented on TIKA-3364:
---

try {{pdfParser.setExtractBookmarksText(false);}}

> PDF Content is extracted twice
> --
>
> Key: TIKA-3364
> URL: https://issues.apache.org/jira/browse/TIKA-3364
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.26
>Reporter: David Pilato
>Priority: Major
> Attachments: Screenshot from 2021-04-23 10-15-22.png, issue-1097.pdf, 
> tika-bookmarks-config.xml
>
>
> Hi
> Coming from [this issue in FSCrawler 
> project|https://github.com/dadoonet/fscrawler/issues/1097], I can see that 
> the text from the PDF document is extracted more than once although PDFBox 
> seems to extract it only once.
> I attached the PDF.
> When I run:
> {code:sh}
> wget https://downloads.apache.org/pdfbox/2.0.23/pdfbox-app-2.0.23.jar
> java -jar pdfbox-app-2.0.23.jar ExtractText -console issue-1097.pdf
> {code}
> I'm getting:
> {code:sh}
> Dummy PDF file
> {code}
> But with Tika:
> {code:sh}
> wget https://downloads.apache.org/tika/tika-app-1.26.jar
> java -jar tika-app-1.26.jar
> {code}
> I'm getting:
> {code:xml}
>  xmlns="http://www.w3.org/1999/xhtml";>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  content="3df79d34abbca99308e79cb94461c1893582604d68329a41fd4bec1885e6adb4"/>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Dummy PDF file
> 
> 
>   Dummy PDF file
> 
> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

CFP for ApacheCon 2021 closes in ONE WEEK

2021-04-23 Thread Rich Bowen


[You are receiving this because you're subscribed to one or more dev@
mailing lists for an Apache project, or the ApacheCon Announce list.]

Time is running out to submit your talk for ApacheCon 2021.

The Call for Presentations for ApacheCon @Home 2021, focused on Europe
and North America time zones, closes May 3rd, and is at
https://www.apachecon.com/acah2021/cfp.html

The CFP for ApacheCon Asia, focused on Asia/Pacific time zones, is at
https://apachecon.com/acasia2021/cfp.html and also closes on May 3rd.

ApacheCon is our main event, featuring content from any and all of our
projects, and is your best opportunity to get your project in front of
the largest audience of enthusiasts.

Please don't wait for the last minute. Get your talks in today!

--
Rich Bowen, VP Conferences
The Apache Software Foundation
https://apachecon.com/
@apachecon

[jira] [Commented] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330827#comment-17330827
 ] 

Nick Burch commented on TIKA-3364:
--

I'm not sure if we already have outlines/bookmarks elsewhere in other parsers, 
to copy the suggested markup

But some sort of annotation on the xhtml makes sense to me!

> PDF Content is extracted twice
> --
>
> Key: TIKA-3364
> URL: https://issues.apache.org/jira/browse/TIKA-3364
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.26
>Reporter: David Pilato
>Priority: Major
> Attachments: Screenshot from 2021-04-23 10-15-22.png, issue-1097.pdf, 
> tika-bookmarks-config.xml
>
>
> Hi
> Coming from [this issue in FSCrawler 
> project|https://github.com/dadoonet/fscrawler/issues/1097], I can see that 
> the text from the PDF document is extracted more than once although PDFBox 
> seems to extract it only once.
> I attached the PDF.
> When I run:
> {code:sh}
> wget https://downloads.apache.org/pdfbox/2.0.23/pdfbox-app-2.0.23.jar
> java -jar pdfbox-app-2.0.23.jar ExtractText -console issue-1097.pdf
> {code}
> I'm getting:
> {code:sh}
> Dummy PDF file
> {code}
> But with Tika:
> {code:sh}
> wget https://downloads.apache.org/tika/tika-app-1.26.jar
> java -jar tika-app-1.26.jar
> {code}
> I'm getting:
> {code:xml}
>  xmlns="http://www.w3.org/1999/xhtml";>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  content="3df79d34abbca99308e79cb94461c1893582604d68329a41fd4bec1885e6adb4"/>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Dummy PDF file
> 
> 
>   Dummy PDF file
> 
> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread David Pilato (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330824#comment-17330824
 ] 

David Pilato edited comment on TIKA-3364 at 4/23/21, 2:39 PM:
--

So I tried this:


{code:java}
PDFParser pdfParser = new PDFParser();
DefaultParser defaultParser;

pdfParser.setExtractAnnotationText(false);

if (!fs.getOcr().isEnabled()) {
logger.debug("OCR is disabled. Even though it's detected, it 
must be disabled explicitly");
defaultParser = new DefaultParser(
MediaTypeRegistry.getDefaultRegistry(),
new ServiceLoader(),
Collections.singletonList(TesseractOCRParser.class));
} else {
logger.debug("OCR is activated.");
if (ExternalParser.check("tesseract")) {
logger.debug("OCR strategy for PDF documents is [{}] and 
tesseract was found.", fs.getOcr().getPdfStrategy());
pdfParser.setOcrStrategy(fs.getOcr().getPdfStrategy());
} else {
logger.debug("But Tesseract is not installed so we won't 
run OCR.");
pdfParser.setOcrStrategy("no_ocr");
}
defaultParser = new DefaultParser(
MediaTypeRegistry.getDefaultRegistry(),
new ServiceLoader(),
Collections.singletonList(PDFParser.class));
}

parser = new AutoDetectParser(defaultParser, pdfParser);
{code}

And it seems to be producing the same effect. I'm probably missing something.


When I run it with this configuration, the extracted text is actually:

{code:none}
\nDummy PDF file\n\nDummy PDF file\n\n\n\tDummy PDF file\n\n
{code}

So the text is extracted 3 times.

When I disable OCR with {{pdfParser.setOcrStrategy("no_ocr")}}, I'm getting:

{code:none}
\nDummy PDF file\n\n\n\tDummy PDF file\n\n
{code}




was (Author: dadoonet):
So I tried this:


{code:java}
PDFParser pdfParser = new PDFParser();
DefaultParser defaultParser;

pdfParser.setExtractAnnotationText(false);

if (!fs.getOcr().isEnabled()) {
logger.debug("OCR is disabled. Even though it's detected, it 
must be disabled explicitly");
defaultParser = new DefaultParser(
MediaTypeRegistry.getDefaultRegistry(),
new ServiceLoader(),
Collections.singletonList(TesseractOCRParser.class));
} else {
logger.debug("OCR is activated.");
if (ExternalParser.check("tesseract")) {
logger.debug("OCR strategy for PDF documents is [{}] and 
tesseract was found.", fs.getOcr().getPdfStrategy());
pdfParser.setOcrStrategy(fs.getOcr().getPdfStrategy());
} else {
logger.debug("But Tesseract is not installed so we won't 
run OCR.");
pdfParser.setOcrStrategy("no_ocr");
}
defaultParser = new DefaultParser(
MediaTypeRegistry.getDefaultRegistry(),
new ServiceLoader(),
Collections.singletonList(PDFParser.class));
}

parser = new AutoDetectParser(defaultParser, pdfParser);
{code}

And it seems to be producing the same effect. I'm probably missing something.


When I run it with this configuration, the extracted text is actually:

{code:txt}
\nDummy PDF file\n\nDummy PDF file\n\n\n\tDummy PDF file\n\n
{code}

So the text is extracted 3 times.

When I disable OCR with {{pdfParser.setOcrStrategy("no_ocr")}}, I'm getting:

{code:txt}
\nDummy PDF file\n\n\n\tDummy PDF file\n\n
{code}



> PDF Content is extracted twice
> --
>
> Key: TIKA-3364
> URL: https://issues.apache.org/jira/browse/TIKA-3364
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.26
>Reporter: David Pilato
>Priority: Major
> Attachments: Screenshot from 2021-04-23 10-15-22.png, issue-1097.pdf, 
> tika-bookmarks-config.xml
>
>
> Hi
> Coming from [this issue in FSCrawler 
> project|https://github.com/dadoonet/fscrawler/issues/1097], I can see that 
> the text from the PDF document is extracted more than once although PDFBox 
> seems to extract it only once.
> I attached the PDF.
> When I run:
> {code:sh}
> wget https://downloads.apache.org/pdfbox/2.0.23/pdfbox-app-2.0.23.jar
> java -jar pdfbox-app-2.0.23.jar ExtractText -console issue-1097.pdf
> {code}
> I'm getting:
> {code:sh}
> Dummy PDF file
> {code}
> But with Tika:
> {code:sh}
> wget https://

[jira] [Commented] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread David Pilato (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330824#comment-17330824
 ] 

David Pilato commented on TIKA-3364:


So I trie this:


{code:java}
PDFParser pdfParser = new PDFParser();
DefaultParser defaultParser;

pdfParser.setExtractAnnotationText(false);

if (!fs.getOcr().isEnabled()) {
logger.debug("OCR is disabled. Even though it's detected, it 
must be disabled explicitly");
defaultParser = new DefaultParser(
MediaTypeRegistry.getDefaultRegistry(),
new ServiceLoader(),
Collections.singletonList(TesseractOCRParser.class));
} else {
logger.debug("OCR is activated.");
if (ExternalParser.check("tesseract")) {
logger.debug("OCR strategy for PDF documents is [{}] and 
tesseract was found.", fs.getOcr().getPdfStrategy());
pdfParser.setOcrStrategy(fs.getOcr().getPdfStrategy());
} else {
logger.debug("But Tesseract is not installed so we won't 
run OCR.");
pdfParser.setOcrStrategy("no_ocr");
}
defaultParser = new DefaultParser(
MediaTypeRegistry.getDefaultRegistry(),
new ServiceLoader(),
Collections.singletonList(PDFParser.class));
}

parser = new AutoDetectParser(defaultParser, pdfParser);
{code}

And it seems to be producing the same effect. I'm probably missing something.


When I run it with this configuration, the extracted text is actually:

{code:txt}
\nDummy PDF file\n\nDummy PDF file\n\n\n\tDummy PDF file\n\n
{code}

So the text is extracted 3 times.

When I disable OCR with {{pdfParser.setOcrStrategy("no_ocr")}}, I'm getting:

{code:txt}
\nDummy PDF file\n\n\n\tDummy PDF file\n\n
{code}



> PDF Content is extracted twice
> --
>
> Key: TIKA-3364
> URL: https://issues.apache.org/jira/browse/TIKA-3364
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.26
>Reporter: David Pilato
>Priority: Major
> Attachments: Screenshot from 2021-04-23 10-15-22.png, issue-1097.pdf, 
> tika-bookmarks-config.xml
>
>
> Hi
> Coming from [this issue in FSCrawler 
> project|https://github.com/dadoonet/fscrawler/issues/1097], I can see that 
> the text from the PDF document is extracted more than once although PDFBox 
> seems to extract it only once.
> I attached the PDF.
> When I run:
> {code:sh}
> wget https://downloads.apache.org/pdfbox/2.0.23/pdfbox-app-2.0.23.jar
> java -jar pdfbox-app-2.0.23.jar ExtractText -console issue-1097.pdf
> {code}
> I'm getting:
> {code:sh}
> Dummy PDF file
> {code}
> But with Tika:
> {code:sh}
> wget https://downloads.apache.org/tika/tika-app-1.26.jar
> java -jar tika-app-1.26.jar
> {code}
> I'm getting:
> {code:xml}
>  xmlns="http://www.w3.org/1999/xhtml";>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  content="3df79d34abbca99308e79cb94461c1893582604d68329a41fd4bec1885e6adb4"/>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Dummy PDF file
> 
> 
>   Dummy PDF file
> 
> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread David Pilato (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330824#comment-17330824
 ] 

David Pilato edited comment on TIKA-3364 at 4/23/21, 2:38 PM:
--

So I tried this:


{code:java}
PDFParser pdfParser = new PDFParser();
DefaultParser defaultParser;

pdfParser.setExtractAnnotationText(false);

if (!fs.getOcr().isEnabled()) {
logger.debug("OCR is disabled. Even though it's detected, it 
must be disabled explicitly");
defaultParser = new DefaultParser(
MediaTypeRegistry.getDefaultRegistry(),
new ServiceLoader(),
Collections.singletonList(TesseractOCRParser.class));
} else {
logger.debug("OCR is activated.");
if (ExternalParser.check("tesseract")) {
logger.debug("OCR strategy for PDF documents is [{}] and 
tesseract was found.", fs.getOcr().getPdfStrategy());
pdfParser.setOcrStrategy(fs.getOcr().getPdfStrategy());
} else {
logger.debug("But Tesseract is not installed so we won't 
run OCR.");
pdfParser.setOcrStrategy("no_ocr");
}
defaultParser = new DefaultParser(
MediaTypeRegistry.getDefaultRegistry(),
new ServiceLoader(),
Collections.singletonList(PDFParser.class));
}

parser = new AutoDetectParser(defaultParser, pdfParser);
{code}

And it seems to be producing the same effect. I'm probably missing something.


When I run it with this configuration, the extracted text is actually:

{code:txt}
\nDummy PDF file\n\nDummy PDF file\n\n\n\tDummy PDF file\n\n
{code}

So the text is extracted 3 times.

When I disable OCR with {{pdfParser.setOcrStrategy("no_ocr")}}, I'm getting:

{code:txt}
\nDummy PDF file\n\n\n\tDummy PDF file\n\n
{code}




was (Author: dadoonet):
So I trie this:


{code:java}
PDFParser pdfParser = new PDFParser();
DefaultParser defaultParser;

pdfParser.setExtractAnnotationText(false);

if (!fs.getOcr().isEnabled()) {
logger.debug("OCR is disabled. Even though it's detected, it 
must be disabled explicitly");
defaultParser = new DefaultParser(
MediaTypeRegistry.getDefaultRegistry(),
new ServiceLoader(),
Collections.singletonList(TesseractOCRParser.class));
} else {
logger.debug("OCR is activated.");
if (ExternalParser.check("tesseract")) {
logger.debug("OCR strategy for PDF documents is [{}] and 
tesseract was found.", fs.getOcr().getPdfStrategy());
pdfParser.setOcrStrategy(fs.getOcr().getPdfStrategy());
} else {
logger.debug("But Tesseract is not installed so we won't 
run OCR.");
pdfParser.setOcrStrategy("no_ocr");
}
defaultParser = new DefaultParser(
MediaTypeRegistry.getDefaultRegistry(),
new ServiceLoader(),
Collections.singletonList(PDFParser.class));
}

parser = new AutoDetectParser(defaultParser, pdfParser);
{code}

And it seems to be producing the same effect. I'm probably missing something.


When I run it with this configuration, the extracted text is actually:

{code:txt}
\nDummy PDF file\n\nDummy PDF file\n\n\n\tDummy PDF file\n\n
{code}

So the text is extracted 3 times.

When I disable OCR with {{pdfParser.setOcrStrategy("no_ocr")}}, I'm getting:

{code:txt}
\nDummy PDF file\n\n\n\tDummy PDF file\n\n
{code}



> PDF Content is extracted twice
> --
>
> Key: TIKA-3364
> URL: https://issues.apache.org/jira/browse/TIKA-3364
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.26
>Reporter: David Pilato
>Priority: Major
> Attachments: Screenshot from 2021-04-23 10-15-22.png, issue-1097.pdf, 
> tika-bookmarks-config.xml
>
>
> Hi
> Coming from [this issue in FSCrawler 
> project|https://github.com/dadoonet/fscrawler/issues/1097], I can see that 
> the text from the PDF document is extracted more than once although PDFBox 
> seems to extract it only once.
> I attached the PDF.
> When I run:
> {code:sh}
> wget https://downloads.apache.org/pdfbox/2.0.23/pdfbox-app-2.0.23.jar
> java -jar pdfbox-app-2.0.23.jar ExtractText -console issue-1097.pdf
> {code}
> I'm getting:
> {code:sh}
> Dummy PDF file
> {code}
> But with Tika:
> {code:sh}
> wget https://dow

[jira] [Commented] (TIKA-3324) Add checkstyle checker

2021-04-23 Thread Hudson (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330823#comment-17330823
 ] 

Hudson commented on TIKA-3324:
--

FAILURE: Integrated in Jenkins build Tika » tika-main-jdk8 #205 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/205/])
TIKA-3324 -- update pom files to index 2 spaces (we had some differences); 
(tallison: 
[https://github.com/apache/tika/commit/f53c527552da57bf2936eff2ae6c326a59bf095d])
* (edit) 
tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/ExtractComparer.java
* (edit) tika-batch/pom.xml
* (edit) 
tika-eval/tika-eval-core/src/main/java/org/apache/tika/eval/core/textstats/TokenCountPriorityQueue.java
* (edit) 
tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/batch/FileProfilerBuilder.java
* (edit) 
tika-example/src/main/java/org/apache/tika/example/MetadataAwareLuceneIndexer.java
* (edit) 
tika-eval/tika-eval-core/src/main/java/org/apache/tika/eval/core/util/ContentTagParser.java
* (edit) 
tika-eval/tika-eval-core/src/test/java/org/apache/tika/eval/core/tokens/TokenCounterTest.java
* (edit) 
tika-eval/tika-eval-core/src/main/java/org/apache/tika/eval/core/tokens/CJKBigramAwareLengthFilterFactory.java
* (edit) tika-parsers/tika-parsers-extended/pom.xml
* (edit) tika-server/tika-server-client/pom.xml
* (edit) 
tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java
* (edit) 
tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-font-module/pom.xml
* (edit) tika-translate/pom.xml
* (edit) tika-parsers/pom.xml
* (edit) tika-pipes/tika-emitters/tika-emitter-s3/pom.xml
* (edit) 
tika-example/src/main/java/org/apache/tika/example/RollbackSoftware.java
* (edit) 
tika-serialization/src/main/java/org/apache/tika/metadata/serialization/JsonStreamingSerializer.java
* (edit) 
tika-example/src/main/java/org/apache/tika/example/DisplayMetInstance.java
* (edit) tika-parsers/tika-parsers-advanced/tika-age-recogniser/pom.xml
* (edit) 
tika-batch/src/main/java/org/apache/tika/batch/builders/ParserFactoryBuilder.java
* (edit) 
tika-batch/src/main/java/org/apache/tika/batch/builders/StatusReporterBuilder.java
* (edit) tika-eval/tika-eval-app/pom.xml
* (edit) 
tika-eval/tika-eval-app/src/test/java/org/apache/tika/eval/app/AnalyzerManagerTest.java
* (edit) 
tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/FileProfiler.java
* (edit) 
tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/tools/TopCommonTokenCounter.java
* (edit) 
tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/reports/XLSXNumFormatter.java
* (edit) tika-parsers/tika-parsers-extended/tika-parser-sqlite3-module/pom.xml
* (edit) tika-parsers/tika-parsers-advanced/tika-dl/pom.xml
* (edit) 
tika-batch/src/main/java/org/apache/tika/batch/IFileProcessorFutureResult.java
* (edit) 
tika-example/src/main/java/org/apache/tika/example/EncryptedPrescriptionDetector.java
* (edit) 
tika-eval/tika-eval-core/src/test/java/org/apache/tika/eval/core/langid/LangIdTest.java
* (edit) 
tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/tools/SlowCompositeReaderWrapper.java
* (edit) tika-batch/src/test/java/org/apache/tika/batch/fs/BatchProcessTest.java
* (edit) 
tika-batch/src/main/java/org/apache/tika/batch/builders/InterrupterBuilder.java
* (edit) 
tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-ocr-module/pom.xml
* (edit) 
tika-eval/tika-eval-core/src/main/java/org/apache/tika/eval/core/metadata/TikaEvalMetadataFilter.java
* (edit) 
tika-langdetect/tika-langdetect-lingo24/src/main/java/org/apache/tika/langdetect/lingo24/Lingo24LangDetector.java
* (edit) 
tika-eval/tika-eval-core/src/main/java/org/apache/tika/eval/core/tokens/CommonTokenCountManager.java
* (edit) 
tika-batch/src/main/java/org/apache/tika/batch/fs/FSOutputStreamFactory.java
* (edit) tika-pipes/tika-fetch-iterators/tika-fetch-iterator-s3/pom.xml
* (edit) tika-langdetect/tika-langdetect-opennlp/pom.xml
* (edit) 
tika-example/src/main/java/org/apache/tika/example/LanguageDetectorExample.java
* (delete) tika-batch/src/test/resources/log4j_process.properties
* (edit) 
tika-example/src/main/java/org/apache/tika/example/DumpTikaConfigExample.java
* (edit) 
tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/io/ExtractReaderException.java
* (edit) 
tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-text-module/pom.xml
* (add) 
tika-langdetect/tika-langdetect-mitll-text/src/test/resources/log4j2.properties
* (edit) tika-pipes/tika-fetch-iterators/tika-fetch-iterator-jdbc/pom.xml
* (edit) 
tika-example/src/test/java/org/apache/tika/example/ContentHandlerExampleTest.java
* (edit) 
tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-miscoffice-module/pom.xml
* (edit) 
tika-eval/tika-eval-app/src/test/java/org/apache/tika/eval/app/db/AbstractBufferTest.java
* (edit) 
tika-eval/tik

[jira] [Commented] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread Tim Allison (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330810#comment-17330810
 ] 

Tim Allison commented on TIKA-3364:
---

We should probably add extra markup in the xhtml to identify the 
outlines/bookmarks ...

> PDF Content is extracted twice
> --
>
> Key: TIKA-3364
> URL: https://issues.apache.org/jira/browse/TIKA-3364
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.26
>Reporter: David Pilato
>Priority: Major
> Attachments: Screenshot from 2021-04-23 10-15-22.png, issue-1097.pdf, 
> tika-bookmarks-config.xml
>
>
> Hi
> Coming from [this issue in FSCrawler 
> project|https://github.com/dadoonet/fscrawler/issues/1097], I can see that 
> the text from the PDF document is extracted more than once although PDFBox 
> seems to extract it only once.
> I attached the PDF.
> When I run:
> {code:sh}
> wget https://downloads.apache.org/pdfbox/2.0.23/pdfbox-app-2.0.23.jar
> java -jar pdfbox-app-2.0.23.jar ExtractText -console issue-1097.pdf
> {code}
> I'm getting:
> {code:sh}
> Dummy PDF file
> {code}
> But with Tika:
> {code:sh}
> wget https://downloads.apache.org/tika/tika-app-1.26.jar
> java -jar tika-app-1.26.jar
> {code}
> I'm getting:
> {code:xml}
>  xmlns="http://www.w3.org/1999/xhtml";>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  content="3df79d34abbca99308e79cb94461c1893582604d68329a41fd4bec1885e6adb4"/>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Dummy PDF file
> 
> 
>   Dummy PDF file
> 
> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread Tim Allison (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330809#comment-17330809
 ] 

Tim Allison commented on TIKA-3364:
---

You can see the text under the {{Outlines}} node.

> PDF Content is extracted twice
> --
>
> Key: TIKA-3364
> URL: https://issues.apache.org/jira/browse/TIKA-3364
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.26
>Reporter: David Pilato
>Priority: Major
> Attachments: Screenshot from 2021-04-23 10-15-22.png, issue-1097.pdf, 
> tika-bookmarks-config.xml
>
>
> Hi
> Coming from [this issue in FSCrawler 
> project|https://github.com/dadoonet/fscrawler/issues/1097], I can see that 
> the text from the PDF document is extracted more than once although PDFBox 
> seems to extract it only once.
> I attached the PDF.
> When I run:
> {code:sh}
> wget https://downloads.apache.org/pdfbox/2.0.23/pdfbox-app-2.0.23.jar
> java -jar pdfbox-app-2.0.23.jar ExtractText -console issue-1097.pdf
> {code}
> I'm getting:
> {code:sh}
> Dummy PDF file
> {code}
> But with Tika:
> {code:sh}
> wget https://downloads.apache.org/tika/tika-app-1.26.jar
> java -jar tika-app-1.26.jar
> {code}
> I'm getting:
> {code:xml}
>  xmlns="http://www.w3.org/1999/xhtml";>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  content="3df79d34abbca99308e79cb94461c1893582604d68329a41fd4bec1885e6adb4"/>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Dummy PDF file
> 
> 
>   Dummy PDF file
> 
> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread Tim Allison (Jira)



 [ 
https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-3364:
--
Attachment: Screenshot from 2021-04-23 10-15-22.png

> PDF Content is extracted twice
> --
>
> Key: TIKA-3364
> URL: https://issues.apache.org/jira/browse/TIKA-3364
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.26
>Reporter: David Pilato
>Priority: Major
> Attachments: Screenshot from 2021-04-23 10-15-22.png, issue-1097.pdf, 
> tika-bookmarks-config.xml
>
>
> Hi
> Coming from [this issue in FSCrawler 
> project|https://github.com/dadoonet/fscrawler/issues/1097], I can see that 
> the text from the PDF document is extracted more than once although PDFBox 
> seems to extract it only once.
> I attached the PDF.
> When I run:
> {code:sh}
> wget https://downloads.apache.org/pdfbox/2.0.23/pdfbox-app-2.0.23.jar
> java -jar pdfbox-app-2.0.23.jar ExtractText -console issue-1097.pdf
> {code}
> I'm getting:
> {code:sh}
> Dummy PDF file
> {code}
> But with Tika:
> {code:sh}
> wget https://downloads.apache.org/tika/tika-app-1.26.jar
> java -jar tika-app-1.26.jar
> {code}
> I'm getting:
> {code:xml}
>  xmlns="http://www.w3.org/1999/xhtml";>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  content="3df79d34abbca99308e79cb94461c1893582604d68329a41fd4bec1885e6adb4"/>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Dummy PDF file
> 
> 
>   Dummy PDF file
> 
> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread Tim Allison (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330805#comment-17330805
 ] 

Tim Allison edited comment on TIKA-3364 at 4/23/21, 2:13 PM:
-

With the attached config file, I get this:

{noformat}



Dummy PDF file



{noformat}


was (Author: talli...@mitre.org):
{noformat}



Dummy PDF file



{noformat}

> PDF Content is extracted twice
> --
>
> Key: TIKA-3364
> URL: https://issues.apache.org/jira/browse/TIKA-3364
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.26
>Reporter: David Pilato
>Priority: Major
> Attachments: issue-1097.pdf, tika-bookmarks-config.xml
>
>
> Hi
> Coming from [this issue in FSCrawler 
> project|https://github.com/dadoonet/fscrawler/issues/1097], I can see that 
> the text from the PDF document is extracted more than once although PDFBox 
> seems to extract it only once.
> I attached the PDF.
> When I run:
> {code:sh}
> wget https://downloads.apache.org/pdfbox/2.0.23/pdfbox-app-2.0.23.jar
> java -jar pdfbox-app-2.0.23.jar ExtractText -console issue-1097.pdf
> {code}
> I'm getting:
> {code:sh}
> Dummy PDF file
> {code}
> But with Tika:
> {code:sh}
> wget https://downloads.apache.org/tika/tika-app-1.26.jar
> java -jar tika-app-1.26.jar
> {code}
> I'm getting:
> {code:xml}
>  xmlns="http://www.w3.org/1999/xhtml";>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  content="3df79d34abbca99308e79cb94461c1893582604d68329a41fd4bec1885e6adb4"/>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Dummy PDF file
> 
> 
>   Dummy PDF file
> 
> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread Tim Allison (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330805#comment-17330805
 ] 

Tim Allison commented on TIKA-3364:
---

{noformat}



Dummy PDF file



{noformat}

> PDF Content is extracted twice
> --
>
> Key: TIKA-3364
> URL: https://issues.apache.org/jira/browse/TIKA-3364
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.26
>Reporter: David Pilato
>Priority: Major
> Attachments: issue-1097.pdf, tika-bookmarks-config.xml
>
>
> Hi
> Coming from [this issue in FSCrawler 
> project|https://github.com/dadoonet/fscrawler/issues/1097], I can see that 
> the text from the PDF document is extracted more than once although PDFBox 
> seems to extract it only once.
> I attached the PDF.
> When I run:
> {code:sh}
> wget https://downloads.apache.org/pdfbox/2.0.23/pdfbox-app-2.0.23.jar
> java -jar pdfbox-app-2.0.23.jar ExtractText -console issue-1097.pdf
> {code}
> I'm getting:
> {code:sh}
> Dummy PDF file
> {code}
> But with Tika:
> {code:sh}
> wget https://downloads.apache.org/tika/tika-app-1.26.jar
> java -jar tika-app-1.26.jar
> {code}
> I'm getting:
> {code:xml}
>  xmlns="http://www.w3.org/1999/xhtml";>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  content="3df79d34abbca99308e79cb94461c1893582604d68329a41fd4bec1885e6adb4"/>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Dummy PDF file
> 
> 
>   Dummy PDF file
> 
> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread Tim Allison (Jira)



 [ 
https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-3364:
--
Attachment: tika-bookmarks-config.xml

> PDF Content is extracted twice
> --
>
> Key: TIKA-3364
> URL: https://issues.apache.org/jira/browse/TIKA-3364
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.26
>Reporter: David Pilato
>Priority: Major
> Attachments: issue-1097.pdf, tika-bookmarks-config.xml
>
>
> Hi
> Coming from [this issue in FSCrawler 
> project|https://github.com/dadoonet/fscrawler/issues/1097], I can see that 
> the text from the PDF document is extracted more than once although PDFBox 
> seems to extract it only once.
> I attached the PDF.
> When I run:
> {code:sh}
> wget https://downloads.apache.org/pdfbox/2.0.23/pdfbox-app-2.0.23.jar
> java -jar pdfbox-app-2.0.23.jar ExtractText -console issue-1097.pdf
> {code}
> I'm getting:
> {code:sh}
> Dummy PDF file
> {code}
> But with Tika:
> {code:sh}
> wget https://downloads.apache.org/tika/tika-app-1.26.jar
> java -jar tika-app-1.26.jar
> {code}
> I'm getting:
> {code:xml}
>  xmlns="http://www.w3.org/1999/xhtml";>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  content="3df79d34abbca99308e79cb94461c1893582604d68329a41fd4bec1885e6adb4"/>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Dummy PDF file
> 
> 
>   Dummy PDF file
> 
> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread Tim Allison (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330799#comment-17330799
 ] 

Tim Allison commented on TIKA-3364:
---

The PDF contains bookmark text, which is what is triggering the .  You can 
configure Tika not to extract bookmark text with "extractBookmarksText" with 
something like:

{noformat}





false



> PDF Content is extracted twice
> --
>
> Key: TIKA-3364
> URL: https://issues.apache.org/jira/browse/TIKA-3364
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.26
>Reporter: David Pilato
>Priority: Major
> Attachments: issue-1097.pdf
>
>
> Hi
> Coming from [this issue in FSCrawler 
> project|https://github.com/dadoonet/fscrawler/issues/1097], I can see that 
> the text from the PDF document is extracted more than once although PDFBox 
> seems to extract it only once.
> I attached the PDF.
> When I run:
> {code:sh}
> wget https://downloads.apache.org/pdfbox/2.0.23/pdfbox-app-2.0.23.jar
> java -jar pdfbox-app-2.0.23.jar ExtractText -console issue-1097.pdf
> {code}
> I'm getting:
> {code:sh}
> Dummy PDF file
> {code}
> But with Tika:
> {code:sh}
> wget https://downloads.apache.org/tika/tika-app-1.26.jar
> java -jar tika-app-1.26.jar
> {code}
> I'm getting:
> {code:xml}
>  xmlns="http://www.w3.org/1999/xhtml";>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  content="3df79d34abbca99308e79cb94461c1893582604d68329a41fd4bec1885e6adb4"/>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Dummy PDF file
> 
> 
>   Dummy PDF file
> 
> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread Tim Allison (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330799#comment-17330799
 ] 

Tim Allison edited comment on TIKA-3364 at 4/23/21, 2:08 PM:
-

The PDF contains bookmark text, which is what is triggering the .  You can 
configure Tika not to extract bookmark text with "extractBookmarksText" with 
something like:

{noformat}





false
{noformat}




was (Author: talli...@mitre.org):
The PDF contains bookmark text, which is what is triggering the .  You can 
configure Tika not to extract bookmark text with "extractBookmarksText" with 
something like:

{noformat}





false
{noformat}



> PDF Content is extracted twice
> --
>
> Key: TIKA-3364
> URL: https://issues.apache.org/jira/browse/TIKA-3364
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.26
>Reporter: David Pilato
>Priority: Major
> Attachments: issue-1097.pdf
>
>
> Hi
> Coming from [this issue in FSCrawler 
> project|https://github.com/dadoonet/fscrawler/issues/1097], I can see that 
> the text from the PDF document is extracted more than once although PDFBox 
> seems to extract it only once.
> I attached the PDF.
> When I run:
> {code:sh}
> wget https://downloads.apache.org/pdfbox/2.0.23/pdfbox-app-2.0.23.jar
> java -jar pdfbox-app-2.0.23.jar ExtractText -console issue-1097.pdf
> {code}
> I'm getting:
> {code:sh}
> Dummy PDF file
> {code}
> But with Tika:
> {code:sh}
> wget https://downloads.apache.org/tika/tika-app-1.26.jar
> java -jar tika-app-1.26.jar
> {code}
> I'm getting:
> {code:xml}
>  xmlns="http://www.w3.org/1999/xhtml";>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  content="3df79d34abbca99308e79cb94461c1893582604d68329a41fd4bec1885e6adb4"/>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Dummy PDF file
> 
> 
>   Dummy PDF file
> 
> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread Tim Allison (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330799#comment-17330799
 ] 

Tim Allison edited comment on TIKA-3364 at 4/23/21, 2:08 PM:
-

The PDF contains bookmark text, which is what is triggering the .  You can 
configure Tika not to extract bookmark text with "extractBookmarksText" with 
something like:

{noformat}





false
{noformat}




was (Author: talli...@mitre.org):
The PDF contains bookmark text, which is what is triggering the .  You can 
configure Tika not to extract bookmark text with "extractBookmarksText" with 
something like:

{noformat}





false



> PDF Content is extracted twice
> --
>
> Key: TIKA-3364
> URL: https://issues.apache.org/jira/browse/TIKA-3364
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.26
>Reporter: David Pilato
>Priority: Major
> Attachments: issue-1097.pdf
>
>
> Hi
> Coming from [this issue in FSCrawler 
> project|https://github.com/dadoonet/fscrawler/issues/1097], I can see that 
> the text from the PDF document is extracted more than once although PDFBox 
> seems to extract it only once.
> I attached the PDF.
> When I run:
> {code:sh}
> wget https://downloads.apache.org/pdfbox/2.0.23/pdfbox-app-2.0.23.jar
> java -jar pdfbox-app-2.0.23.jar ExtractText -console issue-1097.pdf
> {code}
> I'm getting:
> {code:sh}
> Dummy PDF file
> {code}
> But with Tika:
> {code:sh}
> wget https://downloads.apache.org/tika/tika-app-1.26.jar
> java -jar tika-app-1.26.jar
> {code}
> I'm getting:
> {code:xml}
>  xmlns="http://www.w3.org/1999/xhtml";>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  content="3df79d34abbca99308e79cb94461c1893582604d68329a41fd4bec1885e6adb4"/>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Dummy PDF file
> 
> 
>   Dummy PDF file
> 
> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread David Pilato (Jira)

David Pilato created TIKA-3364:
--

 Summary: PDF Content is extracted twice
 Key: TIKA-3364
 URL: https://issues.apache.org/jira/browse/TIKA-3364
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.26
Reporter: David Pilato
 Attachments: issue-1097.pdf

Hi

Coming from [this issue in FSCrawler 
project|https://github.com/dadoonet/fscrawler/issues/1097], I can see that the 
text from the PDF document is extracted more than once although PDFBox seems to 
extract it only once.

I attached the PDF.

When I run:

{code:sh}
wget https://downloads.apache.org/pdfbox/2.0.23/pdfbox-app-2.0.23.jar
java -jar pdfbox-app-2.0.23.jar ExtractText -console issue-1097.pdf
{code}

I'm getting:

{code:sh}
Dummy PDF file
{code}

But with Tika:

{code:sh}
wget https://downloads.apache.org/tika/tika-app-1.26.jar
java -jar tika-app-1.26.jar
{code}

I'm getting:

{code:xml}
http://www.w3.org/1999/xhtml";>








































Dummy PDF file


Dummy PDF file


{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

41 matches

Mail list logo