[jira] [Commented] (TIKA-3666) Detect and indicate file encrypted with Rights Management Service RMS/IRM

2022-04-19 Thread August Valera (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17524177#comment-17524177
 ] 

August Valera commented on TIKA-3666:
-

[~tallison] I can confirm that this works on the {{.docx}} file I ran through 
POIFSViewer, behavior is exactly as expected, great work!

 
{code:java}
Exception in thread "main" 
org.apache.tika.exception.EncryptedDocumentException: DRM encrypted document is 
not yet supported by Apache POI
    at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:277)
    at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:175)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:178)
    at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:1086)
    at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:510)
    at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:259) {code}
Hopefully with this result I can poke those above to provide me some other 
Office variants to further validate this fix.

Although I presume since it relies on POI, this won't cover the PDF 
implementation of RMS.

 

> Detect and indicate file encrypted with Rights Management Service RMS/IRM
> -
>
> Key: TIKA-3666
> URL: https://issues.apache.org/jira/browse/TIKA-3666
> Project: Tika
>  Issue Type: Improvement
>  Components: metadata
>Reporter: August Valera
>Priority: Major
> Attachments: poifsviewer.txt, sam-poifsviewer.txt
>
>
> Rights Management Service (RMS), implemented in MS Office as Information 
> Rights Management (IRM), allows organizations to set file permissions that 
> are stored within the file. In most cases, this will result in the file 
> getting a new extension (with a prefix p, such as {{.txt}} becoming 
> {{{}.ptxt{}}}), but in the case of MS Office and PDF files, which support 
> this natively, the implementation results in the file contents being 
> encrypted without any extension change. 
> h4. Current behavior
> Running such files through Tika produces results as if it was an empty file 
> ran through {{DefaultParser}} and {{{}OfficeParser{}}}.
> h4. Expected behavior
> Extract more metadata about necessary permissions to view (if possible), and 
> throwing {{EncryptedDocumentException}} as is the case with Office files 
> encrypted in the more traditional manner.
> Reference: 
> [https://docs.microsoft.com/en-us/azure/information-protection/rms-client/clientv2-admin-guide-file-types#supported-file-types-for-classification-and-protection]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3666) Detect and indicate file encrypted with Rights Management Service RMS/IRM

2022-04-19 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17524204#comment-17524204
 ] 

Tim Allison commented on TIKA-3666:
---

There’s a chance pdfs are also wrapped in an ole2 container and protected the 
same as doc/ppt/xls. I can’t tell from the online docs. Without a sample or 
clearer documentation, there’s not much I can do.

Do we know if .ptxt, for example, are wrapped in an ole2 container?

> Detect and indicate file encrypted with Rights Management Service RMS/IRM
> -
>
> Key: TIKA-3666
> URL: https://issues.apache.org/jira/browse/TIKA-3666
> Project: Tika
>  Issue Type: Improvement
>  Components: metadata
>Reporter: August Valera
>Priority: Major
> Attachments: poifsviewer.txt, sam-poifsviewer.txt
>
>
> Rights Management Service (RMS), implemented in MS Office as Information 
> Rights Management (IRM), allows organizations to set file permissions that 
> are stored within the file. In most cases, this will result in the file 
> getting a new extension (with a prefix p, such as {{.txt}} becoming 
> {{{}.ptxt{}}}), but in the case of MS Office and PDF files, which support 
> this natively, the implementation results in the file contents being 
> encrypted without any extension change. 
> h4. Current behavior
> Running such files through Tika produces results as if it was an empty file 
> ran through {{DefaultParser}} and {{{}OfficeParser{}}}.
> h4. Expected behavior
> Extract more metadata about necessary permissions to view (if possible), and 
> throwing {{EncryptedDocumentException}} as is the case with Office files 
> encrypted in the more traditional manner.
> Reference: 
> [https://docs.microsoft.com/en-us/azure/information-protection/rms-client/clientv2-admin-guide-file-types#supported-file-types-for-classification-and-protection]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (TIKA-3720) IllegalArgumentException in PDF parser

2022-04-19 Thread denisn (Jira)
denisn created TIKA-3720:


 Summary: IllegalArgumentException in PDF parser
 Key: TIKA-3720
 URL: https://issues.apache.org/jira/browse/TIKA-3720
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.23
 Environment: Fedora 36, Java 11, Scala 2.13.4, Tika 1.28.1
Reporter: denisn
 Attachments: test.pdf

Tika packages: 

"org.apache.tika" % "tika" %  c
"org.apache.tika" % "tika-core" %  1.28.1
"org.apache.tika" % "tika-parsers" %  1.28.1
"org.apache.poi" % "poi" % "4.0.1"

"org.apache.poi" % "poi-ooxml" % "4.0.1"

 

It seems to work fine in 1.22 but in 1.23 and all following versions there is 
an error. I've attached the pdf file which i've tested.

 

Exception text:
{code:java}
java.lang.IllegalArgumentException
    at org.apache.xerces.jaxp.DocumentBuilderFactoryImpl.setAttribute(Unknown 
Source)
    at 
org.apache.tika.utils.XMLReaderUtils.trySetXercesSecurityManager(XMLReaderUtils.java:721)
    at 
org.apache.tika.utils.XMLReaderUtils.getDocumentBuilderFactory(XMLReaderUtils.java:289)
    at 
org.apache.tika.utils.XMLReaderUtils.getDocumentBuilder(XMLReaderUtils.java:305)
    at 
org.apache.tika.parser.external.ExternalParsersConfigReader.read(ExternalParsersConfigReader.java:58)
    at 
org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:67)
    at 
org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:59)
    at 
org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:49)
    at 
org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:44)
    at 
org.apache.tika.parser.external.CompositeExternalParser.(CompositeExternalParser.java:44)
    at 
org.apache.tika.parser.external.CompositeExternalParser.(CompositeExternalParser.java:37)
    at 
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
    at 
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at 
java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
    at java.base/java.lang.Class.newInstance(Class.java:584)
    at 
org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:358)
    at 
org.apache.tika.parser.DefaultParser.getDefaultParsers(DefaultParser.java:55)
    at org.apache.tika.parser.DefaultParser.(DefaultParser.java:85)
    at org.apache.tika.parser.DefaultParser.(DefaultParser.java:100)
    at org.apache.tika.parser.DefaultParser.(DefaultParser.java:112)
    at org.apache.tika.parser.DefaultParser.(DefaultParser.java:116)
    at test.Main$DFP.(Main.scala:55)
    at test.Main$CEParser.getParser(Main.scala:75)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:269)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
    at test.Main$.parseNode(Main.scala:194)
    at test.Main$$anon$1.parse(Main.scala:151)
    at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
    at 
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:104)
    at 
org.apache.tika.parser.pdf.ImageGraphicsEngine.processImage(ImageGraphicsEngine.java:321)
    at 
org.apache.tika.parser.pdf.ImageGraphicsEngine.drawImage(ImageGraphicsEngine.java:182)
    at 
org.apache.pdfbox.contentstream.operator.graphics.DrawObject.process(DrawObject.java:67)
    at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:939)
    at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:514)
    at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:492)
    at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:155)
    at 
org.apache.tika.parser.pdf.ImageGraphicsEngine.run(ImageGraphicsEngine.java:128)
    at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:159)
    at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:139)
    at 
org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:365)
    at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:127)
    at 
org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:985)
    at 
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238)
    at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:98)
    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:177)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
    at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
    at test.Main$TimeoutPa

[jira] [Updated] (TIKA-3720) IllegalArgumentException in PDF parser

2022-04-19 Thread denisn (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

denisn updated TIKA-3720:
-
Description: 
Tika packages: 

"org.apache.tika" % "tika" %  c
"org.apache.tika" % "tika-core" %  1.28.1
"org.apache.tika" % "tika-parsers" %  1.28.1
"org.apache.poi" % "poi" % "4.0.1"

"org.apache.poi" % "poi-ooxml" % "4.0.1"

 

It seems to work fine in 1.22 but in 1.23 and all following versions there is 
an error. I've attached the pdf file which i've tested.

Exception text:
{code:java}
java.lang.IllegalArgumentException
    at org.apache.xerces.jaxp.DocumentBuilderFactoryImpl.setAttribute(Unknown 
Source)
    at 
org.apache.tika.utils.XMLReaderUtils.trySetXercesSecurityManager(XMLReaderUtils.java:721)
    at 
org.apache.tika.utils.XMLReaderUtils.getDocumentBuilderFactory(XMLReaderUtils.java:289)
    at 
org.apache.tika.utils.XMLReaderUtils.getDocumentBuilder(XMLReaderUtils.java:305)
    at 
org.apache.tika.parser.external.ExternalParsersConfigReader.read(ExternalParsersConfigReader.java:58)
    at 
org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:67)
    at 
org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:59)
    at 
org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:49)
    at 
org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:44)
    at 
org.apache.tika.parser.external.CompositeExternalParser.(CompositeExternalParser.java:44)
    at 
org.apache.tika.parser.external.CompositeExternalParser.(CompositeExternalParser.java:37)
    at 
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
    at 
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at 
java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
    at java.base/java.lang.Class.newInstance(Class.java:584)
    at 
org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:358)
    at 
org.apache.tika.parser.DefaultParser.getDefaultParsers(DefaultParser.java:55)
    at org.apache.tika.parser.DefaultParser.(DefaultParser.java:85)
    at org.apache.tika.parser.DefaultParser.(DefaultParser.java:100)
    at org.apache.tika.parser.DefaultParser.(DefaultParser.java:112)
    at org.apache.tika.parser.DefaultParser.(DefaultParser.java:116)
    at test.Main$DFP.(Main.scala:55)
    at test.Main$CEParser.getParser(Main.scala:75)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:269)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
    at test.Main$.parseNode(Main.scala:194)
    at test.Main$$anon$1.parse(Main.scala:151)
    at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
    at 
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:104)
    at 
org.apache.tika.parser.pdf.ImageGraphicsEngine.processImage(ImageGraphicsEngine.java:321)
    at 
org.apache.tika.parser.pdf.ImageGraphicsEngine.drawImage(ImageGraphicsEngine.java:182)
    at 
org.apache.pdfbox.contentstream.operator.graphics.DrawObject.process(DrawObject.java:67)
    at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:939)
    at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:514)
    at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:492)
    at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:155)
    at 
org.apache.tika.parser.pdf.ImageGraphicsEngine.run(ImageGraphicsEngine.java:128)
    at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:159)
    at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:139)
    at 
org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:365)
    at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:127)
    at 
org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:985)
    at 
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238)
    at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:98)
    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:177)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
    at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
    at test.Main$TimeoutParser.super$parse(Main.scala:67)
    at test.Main$TimeoutParser.$anonfun$parse$1(Main.scala:67)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
    at 
cats.effect.internals.IORunLoop$.cats$effect$intern

[jira] [Updated] (TIKA-3720) IllegalArgumentException in PDF parser

2022-04-19 Thread denisn (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

denisn updated TIKA-3720:
-
Description: 
Tika packages: 
{code:java}
"org.apache.tika" % "tika" %  c
"org.apache.tika" % "tika-core" %  1.28.1
"org.apache.tika" % "tika-parsers" %  1.28.1
"org.apache.poi" % "poi" % "4.0.1"
"org.apache.poi" % "poi-ooxml" % "4.0.1"{code}
It seems to work fine in 1.22 but in 1.23 and all following versions there is 
an error. I've attached the pdf file which i've tested.

Exception text:
{code:java}
java.lang.IllegalArgumentException
    at org.apache.xerces.jaxp.DocumentBuilderFactoryImpl.setAttribute(Unknown 
Source)
    at 
org.apache.tika.utils.XMLReaderUtils.trySetXercesSecurityManager(XMLReaderUtils.java:721)
    at 
org.apache.tika.utils.XMLReaderUtils.getDocumentBuilderFactory(XMLReaderUtils.java:289)
    at 
org.apache.tika.utils.XMLReaderUtils.getDocumentBuilder(XMLReaderUtils.java:305)
    at 
org.apache.tika.parser.external.ExternalParsersConfigReader.read(ExternalParsersConfigReader.java:58)
    at 
org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:67)
    at 
org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:59)
    at 
org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:49)
    at 
org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:44)
    at 
org.apache.tika.parser.external.CompositeExternalParser.(CompositeExternalParser.java:44)
    at 
org.apache.tika.parser.external.CompositeExternalParser.(CompositeExternalParser.java:37)
    at 
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
    at 
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at 
java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
    at java.base/java.lang.Class.newInstance(Class.java:584)
    at 
org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:358)
    at 
org.apache.tika.parser.DefaultParser.getDefaultParsers(DefaultParser.java:55)
    at org.apache.tika.parser.DefaultParser.(DefaultParser.java:85)
    at org.apache.tika.parser.DefaultParser.(DefaultParser.java:100)
    at org.apache.tika.parser.DefaultParser.(DefaultParser.java:112)
    at org.apache.tika.parser.DefaultParser.(DefaultParser.java:116)
    at test.Main$DFP.(Main.scala:55)
    at test.Main$CEParser.getParser(Main.scala:75)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:269)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
    at test.Main$.parseNode(Main.scala:194)
    at test.Main$$anon$1.parse(Main.scala:151)
    at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
    at 
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:104)
    at 
org.apache.tika.parser.pdf.ImageGraphicsEngine.processImage(ImageGraphicsEngine.java:321)
    at 
org.apache.tika.parser.pdf.ImageGraphicsEngine.drawImage(ImageGraphicsEngine.java:182)
    at 
org.apache.pdfbox.contentstream.operator.graphics.DrawObject.process(DrawObject.java:67)
    at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:939)
    at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:514)
    at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:492)
    at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:155)
    at 
org.apache.tika.parser.pdf.ImageGraphicsEngine.run(ImageGraphicsEngine.java:128)
    at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:159)
    at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:139)
    at 
org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:365)
    at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:127)
    at 
org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:985)
    at 
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238)
    at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:98)
    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:177)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
    at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
    at test.Main$TimeoutParser.super$parse(Main.scala:67)
    at test.Main$TimeoutParser.$anonfun$parse$1(Main.scala:67)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
    at 
cats.effect.internals.IORunLoop$.cats$e

[jira] [Updated] (TIKA-3720) IllegalArgumentException in PDF parser

2022-04-19 Thread denisn (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

denisn updated TIKA-3720:
-
Description: 
Tika packages: 
{code:java}
"org.apache.tika" % "tika" %  1.28.1
"org.apache.tika" % "tika-core" %  1.28.1
"org.apache.tika" % "tika-parsers" %  1.28.1
"org.apache.poi" % "poi" % "4.0.1"
"org.apache.poi" % "poi-ooxml" % "4.0.1"{code}
It seems to work fine in 1.22 but in 1.23 and all following versions there is 
an error. I've attached the pdf file which i've tested.

Exception text:
{code:java}
java.lang.IllegalArgumentException
    at org.apache.xerces.jaxp.DocumentBuilderFactoryImpl.setAttribute(Unknown 
Source)
    at 
org.apache.tika.utils.XMLReaderUtils.trySetXercesSecurityManager(XMLReaderUtils.java:721)
    at 
org.apache.tika.utils.XMLReaderUtils.getDocumentBuilderFactory(XMLReaderUtils.java:289)
    at 
org.apache.tika.utils.XMLReaderUtils.getDocumentBuilder(XMLReaderUtils.java:305)
    at 
org.apache.tika.parser.external.ExternalParsersConfigReader.read(ExternalParsersConfigReader.java:58)
    at 
org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:67)
    at 
org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:59)
    at 
org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:49)
    at 
org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:44)
    at 
org.apache.tika.parser.external.CompositeExternalParser.(CompositeExternalParser.java:44)
    at 
org.apache.tika.parser.external.CompositeExternalParser.(CompositeExternalParser.java:37)
    at 
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
    at 
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at 
java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
    at java.base/java.lang.Class.newInstance(Class.java:584)
    at 
org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:358)
    at 
org.apache.tika.parser.DefaultParser.getDefaultParsers(DefaultParser.java:55)
    at org.apache.tika.parser.DefaultParser.(DefaultParser.java:85)
    at org.apache.tika.parser.DefaultParser.(DefaultParser.java:100)
    at org.apache.tika.parser.DefaultParser.(DefaultParser.java:112)
    at org.apache.tika.parser.DefaultParser.(DefaultParser.java:116)
    at test.Main$DFP.(Main.scala:55)
    at test.Main$CEParser.getParser(Main.scala:75)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:269)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
    at test.Main$.parseNode(Main.scala:194)
    at test.Main$$anon$1.parse(Main.scala:151)
    at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
    at 
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:104)
    at 
org.apache.tika.parser.pdf.ImageGraphicsEngine.processImage(ImageGraphicsEngine.java:321)
    at 
org.apache.tika.parser.pdf.ImageGraphicsEngine.drawImage(ImageGraphicsEngine.java:182)
    at 
org.apache.pdfbox.contentstream.operator.graphics.DrawObject.process(DrawObject.java:67)
    at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:939)
    at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:514)
    at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:492)
    at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:155)
    at 
org.apache.tika.parser.pdf.ImageGraphicsEngine.run(ImageGraphicsEngine.java:128)
    at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:159)
    at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:139)
    at 
org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:365)
    at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:127)
    at 
org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:985)
    at 
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238)
    at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:98)
    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:177)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
    at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
    at test.Main$TimeoutParser.super$parse(Main.scala:67)
    at test.Main$TimeoutParser.$anonfun$parse$1(Main.scala:67)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
    at 
cats.effect.internals.IORunLoop$.c

[jira] [Commented] (TIKA-3720) IllegalArgumentException in PDF parser

2022-04-19 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17524298#comment-17524298
 ] 

Tim Allison commented on TIKA-3720:
---

The problem isn't the file, although, thank you for sharing it.  Tika-app in 
both versions works without complaint on the test file.

In 1.28.1, is that a logged warning or a thrown exception?  The 
IllegalArgumentException should be caught and logged.  It should not be thrown. 
 The NullPointerException is likely caused by the PDFParser's inability to find 
the TesseractOCRParser in the AutoDetectParser.  I can't think of what changed 
between 1.22 and 1.23 that would make the TesseractOCRParser unfindable.

In 2.x, I think you're running into the same issue with the PDFParser's 
inability to find the TesseractOCRParser.  In 2.x, are you adding 
tika-parsers-standard-package as a dependency?

What does your TimeoutParser look like?  I worry that might be blocking Tika's 
ability to find the TesseractOCRParser.

Do you have tesseract installed?  Do you want it to run?

> IllegalArgumentException in PDF parser
> --
>
> Key: TIKA-3720
> URL: https://issues.apache.org/jira/browse/TIKA-3720
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.23
> Environment: Fedora 36, Java 11, Scala 2.13.4, Tika 1.28.1
>Reporter: denisn
>Priority: Major
> Attachments: test.pdf
>
>
> Tika packages: 
> {code:java}
> "org.apache.tika" % "tika" %  1.28.1
> "org.apache.tika" % "tika-core" %  1.28.1
> "org.apache.tika" % "tika-parsers" %  1.28.1
> "org.apache.poi" % "poi" % "4.0.1"
> "org.apache.poi" % "poi-ooxml" % "4.0.1"{code}
> It seems to work fine in 1.22 but in 1.23 and all following versions there is 
> an error. I've attached the pdf file which i've tested.
> Exception text:
> {code:java}
> java.lang.IllegalArgumentException
>     at org.apache.xerces.jaxp.DocumentBuilderFactoryImpl.setAttribute(Unknown 
> Source)
>     at 
> org.apache.tika.utils.XMLReaderUtils.trySetXercesSecurityManager(XMLReaderUtils.java:721)
>     at 
> org.apache.tika.utils.XMLReaderUtils.getDocumentBuilderFactory(XMLReaderUtils.java:289)
>     at 
> org.apache.tika.utils.XMLReaderUtils.getDocumentBuilder(XMLReaderUtils.java:305)
>     at 
> org.apache.tika.parser.external.ExternalParsersConfigReader.read(ExternalParsersConfigReader.java:58)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:67)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:59)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:49)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:44)
>     at 
> org.apache.tika.parser.external.CompositeExternalParser.(CompositeExternalParser.java:44)
>     at 
> org.apache.tika.parser.external.CompositeExternalParser.(CompositeExternalParser.java:37)
>     at 
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>  Method)
>     at 
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>     at 
> java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>     at 
> java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
>     at java.base/java.lang.Class.newInstance(Class.java:584)
>     at 
> org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:358)
>     at 
> org.apache.tika.parser.DefaultParser.getDefaultParsers(DefaultParser.java:55)
>     at org.apache.tika.parser.DefaultParser.(DefaultParser.java:85)
>     at org.apache.tika.parser.DefaultParser.(DefaultParser.java:100)
>     at org.apache.tika.parser.DefaultParser.(DefaultParser.java:112)
>     at org.apache.tika.parser.DefaultParser.(DefaultParser.java:116)
>     at test.Main$DFP.(Main.scala:55)
>     at test.Main$CEParser.getParser(Main.scala:75)
>     at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:269)
>     at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>     at test.Main$.parseNode(Main.scala:194)
>     at test.Main$$anon$1.parse(Main.scala:151)
>     at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
>     at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:104)
>     at 
> org.apache.tika.parser.pdf.ImageGraphicsEngine.processImage(ImageGraphicsEngine.java:321)
>     at 
> org.apache.tika.parser.pdf.ImageGraphicsEngine.drawImage(ImageGraphicsEngine.java:182)
>     at 
> org.apache.pdfbox.contentstream.operator.graphics.DrawObject.process(DrawObject.java:67)
>   

[jira] [Commented] (TIKA-3720) IllegalArgumentException in PDF parser

2022-04-19 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17524311#comment-17524311
 ] 

Tim Allison commented on TIKA-3720:
---

In TIKA-2970 (which happened between 1.22 and 1.23), we no longer created a new 
TesseractOCRParser in the PDFParser, but we instead started looking for the 
existing Tesseract parser in the ParseContext: 
https://github.com/apache/tika/commit/ba8088521df55e289b11023d887b161827d3cb90#diff-b4b910eaed1ec8638e1180e8fab4edfe84d75c76059f768fa5ad273cf2340e1eR172

In 1.x, if your custom parser doesn't extend CompositeParser or 
ParserDecorator, our code has no way of finding the TesseractOCRParser.  In 
2.x, we're looking for a StatefulParser, or we try our best with whatever 
Parser is set in the ParseContext.

> IllegalArgumentException in PDF parser
> --
>
> Key: TIKA-3720
> URL: https://issues.apache.org/jira/browse/TIKA-3720
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.23
> Environment: Fedora 36, Java 11, Scala 2.13.4, Tika 1.28.1
>Reporter: denisn
>Priority: Major
> Attachments: test.pdf
>
>
> Tika packages: 
> {code:java}
> "org.apache.tika" % "tika" %  1.28.1
> "org.apache.tika" % "tika-core" %  1.28.1
> "org.apache.tika" % "tika-parsers" %  1.28.1
> "org.apache.poi" % "poi" % "4.0.1"
> "org.apache.poi" % "poi-ooxml" % "4.0.1"{code}
> It seems to work fine in 1.22 but in 1.23 and all following versions there is 
> an error. I've attached the pdf file which i've tested.
> Exception text:
> {code:java}
> java.lang.IllegalArgumentException
>     at org.apache.xerces.jaxp.DocumentBuilderFactoryImpl.setAttribute(Unknown 
> Source)
>     at 
> org.apache.tika.utils.XMLReaderUtils.trySetXercesSecurityManager(XMLReaderUtils.java:721)
>     at 
> org.apache.tika.utils.XMLReaderUtils.getDocumentBuilderFactory(XMLReaderUtils.java:289)
>     at 
> org.apache.tika.utils.XMLReaderUtils.getDocumentBuilder(XMLReaderUtils.java:305)
>     at 
> org.apache.tika.parser.external.ExternalParsersConfigReader.read(ExternalParsersConfigReader.java:58)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:67)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:59)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:49)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:44)
>     at 
> org.apache.tika.parser.external.CompositeExternalParser.(CompositeExternalParser.java:44)
>     at 
> org.apache.tika.parser.external.CompositeExternalParser.(CompositeExternalParser.java:37)
>     at 
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>  Method)
>     at 
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>     at 
> java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>     at 
> java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
>     at java.base/java.lang.Class.newInstance(Class.java:584)
>     at 
> org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:358)
>     at 
> org.apache.tika.parser.DefaultParser.getDefaultParsers(DefaultParser.java:55)
>     at org.apache.tika.parser.DefaultParser.(DefaultParser.java:85)
>     at org.apache.tika.parser.DefaultParser.(DefaultParser.java:100)
>     at org.apache.tika.parser.DefaultParser.(DefaultParser.java:112)
>     at org.apache.tika.parser.DefaultParser.(DefaultParser.java:116)
>     at test.Main$DFP.(Main.scala:55)
>     at test.Main$CEParser.getParser(Main.scala:75)
>     at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:269)
>     at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>     at test.Main$.parseNode(Main.scala:194)
>     at test.Main$$anon$1.parse(Main.scala:151)
>     at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
>     at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:104)
>     at 
> org.apache.tika.parser.pdf.ImageGraphicsEngine.processImage(ImageGraphicsEngine.java:321)
>     at 
> org.apache.tika.parser.pdf.ImageGraphicsEngine.drawImage(ImageGraphicsEngine.java:182)
>     at 
> org.apache.pdfbox.contentstream.operator.graphics.DrawObject.process(DrawObject.java:67)
>     at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:939)
>     at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:514)
>     at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.pro

[jira] [Commented] (TIKA-3666) Detect and indicate file encrypted with Rights Management Service RMS/IRM

2022-04-19 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17524334#comment-17524334
 ] 

Tim Allison commented on TIKA-3666:
---

I'm wondering if we should throw an exception in the OfficeParser if we can't 
figure out what kind of OLE2 file it is and there's an EncryptedPackage entry?

> Detect and indicate file encrypted with Rights Management Service RMS/IRM
> -
>
> Key: TIKA-3666
> URL: https://issues.apache.org/jira/browse/TIKA-3666
> Project: Tika
>  Issue Type: Improvement
>  Components: metadata
>Reporter: August Valera
>Priority: Major
> Attachments: poifsviewer.txt, sam-poifsviewer.txt
>
>
> Rights Management Service (RMS), implemented in MS Office as Information 
> Rights Management (IRM), allows organizations to set file permissions that 
> are stored within the file. In most cases, this will result in the file 
> getting a new extension (with a prefix p, such as {{.txt}} becoming 
> {{{}.ptxt{}}}), but in the case of MS Office and PDF files, which support 
> this natively, the implementation results in the file contents being 
> encrypted without any extension change. 
> h4. Current behavior
> Running such files through Tika produces results as if it was an empty file 
> ran through {{DefaultParser}} and {{{}OfficeParser{}}}.
> h4. Expected behavior
> Extract more metadata about necessary permissions to view (if possible), and 
> throwing {{EncryptedDocumentException}} as is the case with Office files 
> encrypted in the more traditional manner.
> Reference: 
> [https://docs.microsoft.com/en-us/azure/information-protection/rms-client/clientv2-admin-guide-file-types#supported-file-types-for-classification-and-protection]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3720) IllegalArgumentException in PDF parser

2022-04-19 Thread denisn (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17524337#comment-17524337
 ] 

denisn commented on TIKA-3720:
--

Sorry, i was confused because 2.x gave me an output and 1.28.1 not. I had no 
Tesseract installed. 

Yes, it is standard-package dependency in 2.x.

I can't really share all my code but there is my configs:
{code:java}
  val ocrConfig: TesseractOCRConfig = {
val tessConf = new TesseractOCRConfig()
tessConf.setLanguage("rus+eng")
tessConf.setEnableImagePreprocessing(true)
//tessConf.setEnableImageProcessing(1)
tessConf
  }
  val pdfConfig: PDFParserConfig = {
val pdfConf = new PDFParserConfig()

pdfConf.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.OCR_AND_TEXT_EXTRACTION)
pdfConf.setExtractInlineImages(true)
pdfConf
  }
  ocrConfig.setLanguage("rus+eng")
  currentContext.set(classOf[TesseractOCRConfig], ocrConfig)
  currentContext.set(classOf[PDFParserConfig], pdfConfig)
  currentContext.set(classOf[Parser], contentParser)
  currentContext.set(classOf[ArchiveStreamFactory], asf)
  currentContext.set(classOf[Node], node) {code}
I've installed the Tesseract:
{code:java}
$ tesseract -v
tesseract 5.0.1
 leptonica-1.82.0
  libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 2.1.2) : libpng 1.6.37 : libtiff 
4.3.0 : zlib 1.2.11 : libwebp 1.2.2
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found OpenMP 201511 {code}
Now it just stucks in paring due to 
https://issues.apache.org/jira/browse/TIKA-2359 (i guess?)

> IllegalArgumentException in PDF parser
> --
>
> Key: TIKA-3720
> URL: https://issues.apache.org/jira/browse/TIKA-3720
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.23
> Environment: Fedora 36, Java 11, Scala 2.13.4, Tika 1.28.1
>Reporter: denisn
>Priority: Major
> Attachments: test.pdf
>
>
> Tika packages: 
> {code:java}
> "org.apache.tika" % "tika" %  1.28.1
> "org.apache.tika" % "tika-core" %  1.28.1
> "org.apache.tika" % "tika-parsers" %  1.28.1
> "org.apache.poi" % "poi" % "4.0.1"
> "org.apache.poi" % "poi-ooxml" % "4.0.1"{code}
> It seems to work fine in 1.22 but in 1.23 and all following versions there is 
> an error. I've attached the pdf file which i've tested.
> Exception text:
> {code:java}
> java.lang.IllegalArgumentException
>     at org.apache.xerces.jaxp.DocumentBuilderFactoryImpl.setAttribute(Unknown 
> Source)
>     at 
> org.apache.tika.utils.XMLReaderUtils.trySetXercesSecurityManager(XMLReaderUtils.java:721)
>     at 
> org.apache.tika.utils.XMLReaderUtils.getDocumentBuilderFactory(XMLReaderUtils.java:289)
>     at 
> org.apache.tika.utils.XMLReaderUtils.getDocumentBuilder(XMLReaderUtils.java:305)
>     at 
> org.apache.tika.parser.external.ExternalParsersConfigReader.read(ExternalParsersConfigReader.java:58)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:67)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:59)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:49)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:44)
>     at 
> org.apache.tika.parser.external.CompositeExternalParser.(CompositeExternalParser.java:44)
>     at 
> org.apache.tika.parser.external.CompositeExternalParser.(CompositeExternalParser.java:37)
>     at 
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>  Method)
>     at 
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>     at 
> java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>     at 
> java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
>     at java.base/java.lang.Class.newInstance(Class.java:584)
>     at 
> org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:358)
>     at 
> org.apache.tika.parser.DefaultParser.getDefaultParsers(DefaultParser.java:55)
>     at org.apache.tika.parser.DefaultParser.(DefaultParser.java:85)
>     at org.apache.tika.parser.DefaultParser.(DefaultParser.java:100)
>     at org.apache.tika.parser.DefaultParser.(DefaultParser.java:112)
>     at org.apache.tika.parser.DefaultParser.(DefaultParser.java:116)
>     at test.Main$DFP.(Main.scala:55)
>     at test.Main$CEParser.getParser(Main.scala:75)
>     at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:269)
>     at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>     at test.Main$.parseNode(Main.scala:194)
>     at test.Main$$anon$1.pa

[jira] [Comment Edited] (TIKA-3720) IllegalArgumentException in PDF parser

2022-04-19 Thread denisn (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17524337#comment-17524337
 ] 

denisn edited comment on TIKA-3720 at 4/19/22 2:06 PM:
---

Sorry, i was confused because 2.x gave me an output and 1.28.1 not. I had no 
Tesseract installed. 

Yes, it is standard-package dependency in 2.x.

I can't really share all my code but there is my configs:
{code:java}
  val ocrConfig: TesseractOCRConfig = {
val tessConf = new TesseractOCRConfig()
tessConf.setLanguage("rus+eng")
tessConf.setEnableImagePreprocessing(true)
//tessConf.setEnableImageProcessing(1)
tessConf
  }
  val pdfConfig: PDFParserConfig = {
val pdfConf = new PDFParserConfig()

pdfConf.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.OCR_AND_TEXT_EXTRACTION)
pdfConf.setExtractInlineImages(true)
pdfConf
  }
  ocrConfig.setLanguage("rus+eng")
  currentContext.set(classOf[TesseractOCRConfig], ocrConfig)
  currentContext.set(classOf[PDFParserConfig], pdfConfig)
  currentContext.set(classOf[Parser], contentParser)
  currentContext.set(classOf[ArchiveStreamFactory], asf)
  currentContext.set(classOf[Node], node) {code}
I've installed the Tesseract:
{code:java}
$ tesseract -v
tesseract 5.0.1
 leptonica-1.82.0
  libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 2.1.2) : libpng 1.6.37 : libtiff 
4.3.0 : zlib 1.2.11 : libwebp 1.2.2
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found OpenMP 201511 {code}
Now it just stucks in parsing due to 
https://issues.apache.org/jira/browse/TIKA-2359 (i guess?)


was (Author: JIRAUSER288220):
Sorry, i was confused because 2.x gave me an output and 1.28.1 not. I had no 
Tesseract installed. 

Yes, it is standard-package dependency in 2.x.

I can't really share all my code but there is my configs:
{code:java}
  val ocrConfig: TesseractOCRConfig = {
val tessConf = new TesseractOCRConfig()
tessConf.setLanguage("rus+eng")
tessConf.setEnableImagePreprocessing(true)
//tessConf.setEnableImageProcessing(1)
tessConf
  }
  val pdfConfig: PDFParserConfig = {
val pdfConf = new PDFParserConfig()

pdfConf.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.OCR_AND_TEXT_EXTRACTION)
pdfConf.setExtractInlineImages(true)
pdfConf
  }
  ocrConfig.setLanguage("rus+eng")
  currentContext.set(classOf[TesseractOCRConfig], ocrConfig)
  currentContext.set(classOf[PDFParserConfig], pdfConfig)
  currentContext.set(classOf[Parser], contentParser)
  currentContext.set(classOf[ArchiveStreamFactory], asf)
  currentContext.set(classOf[Node], node) {code}
I've installed the Tesseract:
{code:java}
$ tesseract -v
tesseract 5.0.1
 leptonica-1.82.0
  libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 2.1.2) : libpng 1.6.37 : libtiff 
4.3.0 : zlib 1.2.11 : libwebp 1.2.2
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found OpenMP 201511 {code}
Now it just stucks in paring due to 
https://issues.apache.org/jira/browse/TIKA-2359 (i guess?)

> IllegalArgumentException in PDF parser
> --
>
> Key: TIKA-3720
> URL: https://issues.apache.org/jira/browse/TIKA-3720
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.23
> Environment: Fedora 36, Java 11, Scala 2.13.4, Tika 1.28.1
>Reporter: denisn
>Priority: Major
> Attachments: test.pdf
>
>
> Tika packages: 
> {code:java}
> "org.apache.tika" % "tika" %  1.28.1
> "org.apache.tika" % "tika-core" %  1.28.1
> "org.apache.tika" % "tika-parsers" %  1.28.1
> "org.apache.poi" % "poi" % "4.0.1"
> "org.apache.poi" % "poi-ooxml" % "4.0.1"{code}
> It seems to work fine in 1.22 but in 1.23 and all following versions there is 
> an error. I've attached the pdf file which i've tested.
> Exception text:
> {code:java}
> java.lang.IllegalArgumentException
>     at org.apache.xerces.jaxp.DocumentBuilderFactoryImpl.setAttribute(Unknown 
> Source)
>     at 
> org.apache.tika.utils.XMLReaderUtils.trySetXercesSecurityManager(XMLReaderUtils.java:721)
>     at 
> org.apache.tika.utils.XMLReaderUtils.getDocumentBuilderFactory(XMLReaderUtils.java:289)
>     at 
> org.apache.tika.utils.XMLReaderUtils.getDocumentBuilder(XMLReaderUtils.java:305)
>     at 
> org.apache.tika.parser.external.ExternalParsersConfigReader.read(ExternalParsersConfigReader.java:58)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:67)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:59)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:49)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:44)
>     at 

[jira] [Resolved] (TIKA-2359) Extreme slow parsing on the attachment attached

2022-04-19 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2359.
---
Resolution: Not A Problem

> Extreme slow parsing on the attachment attached
> ---
>
> Key: TIKA-2359
> URL: https://issues.apache.org/jira/browse/TIKA-2359
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Eugen Mayer
>Priority: Major
> Attachments: Sample-doc-file-2000kb.doc
>
>
> i have 93s for parsing this document using 1.14 in server or in cli mode.
> Java:
> java version "1.8.0_121"
> Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
> debian-jessie, 8GB ram in a docker container, current xeon 3GHz, so decent (2 
> cores limited)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-2359) Extreme slow parsing on the attachment attached

2022-04-19 Thread Alexander Bias (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17524341#comment-17524341
 ] 

Alexander Bias commented on TIKA-2359:
--

=== English version follows ===


Sehr geehrter Absender,

ich bin nicht mehr an der Universität Ulm tätig.
Ihre E-Mail wird nicht gelesen und nicht weitergeleitet.

Falls Sie mich bezüglich der Services des Kommunikations- und 
Informationszentrums,
insbesondere bezüglich Moodle, kontaktieren wollten, wenden Sie sich bitte 
stattdessen an
den Helpdesk des kiz unter helpd...@uni-ulm.de.

Falls Sie mich im Rahmen der Moodle Community oder zu einem anderem Thema zu 
digitalen
Lerntechnologien kontaktieren wollten, können Sie mich ab sofort unter
alexander.b...@lernlink.de erreichen.


=


Dear sender,

I am not working anymore for Ulm university.
Your email will not be neither read nor forwarded.

If you contacted me regarding the IT Services of the Communication and 
Information Centre
(kiz), especially regarding Moodle, please contact the kiz helpdesk at 
helpd...@uni-ulm.de
instead.

If you contacted me in the context of the Moodle community or another topic of 
digital
learning technologies, you can contact me at alexander.b...@lernlink.de from 
now on.


> Extreme slow parsing on the attachment attached
> ---
>
> Key: TIKA-2359
> URL: https://issues.apache.org/jira/browse/TIKA-2359
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Eugen Mayer
>Priority: Major
> Attachments: Sample-doc-file-2000kb.doc
>
>
> i have 93s for parsing this document using 1.14 in server or in cli mode.
> Java:
> java version "1.8.0_121"
> Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
> debian-jessie, 8GB ram in a docker container, current xeon 3GHz, so decent (2 
> cores limited)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3720) IllegalArgumentException in PDF parser

2022-04-19 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17524343#comment-17524343
 ] 

Tim Allison commented on TIKA-3720:
---

bq.  I had no Tesseract installed

Ah, ok, that together with your config explain the problem in 2x.  You were 
requiring that the PDFParser run OCR on every page ("OCR_AND_TEXT_EXTRACTION") 
but you hadn't installed tesseract.  So, Tika was actually throwing a 
meaningful exception correctly.

If you don't want to run OCR on every page, and you're ok with Tika's AUTO 
mode, that might be more efficient.  The notion there is that Tika will only 
run OCR on PDF pages if very little text was extracted or if the page was 
likely to have junk text (high proportion of characters with no unicode 
mappings). 

> IllegalArgumentException in PDF parser
> --
>
> Key: TIKA-3720
> URL: https://issues.apache.org/jira/browse/TIKA-3720
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.23
> Environment: Fedora 36, Java 11, Scala 2.13.4, Tika 1.28.1
>Reporter: denisn
>Priority: Major
> Attachments: test.pdf
>
>
> Tika packages: 
> {code:java}
> "org.apache.tika" % "tika" %  1.28.1
> "org.apache.tika" % "tika-core" %  1.28.1
> "org.apache.tika" % "tika-parsers" %  1.28.1
> "org.apache.poi" % "poi" % "4.0.1"
> "org.apache.poi" % "poi-ooxml" % "4.0.1"{code}
> It seems to work fine in 1.22 but in 1.23 and all following versions there is 
> an error. I've attached the pdf file which i've tested.
> Exception text:
> {code:java}
> java.lang.IllegalArgumentException
>     at org.apache.xerces.jaxp.DocumentBuilderFactoryImpl.setAttribute(Unknown 
> Source)
>     at 
> org.apache.tika.utils.XMLReaderUtils.trySetXercesSecurityManager(XMLReaderUtils.java:721)
>     at 
> org.apache.tika.utils.XMLReaderUtils.getDocumentBuilderFactory(XMLReaderUtils.java:289)
>     at 
> org.apache.tika.utils.XMLReaderUtils.getDocumentBuilder(XMLReaderUtils.java:305)
>     at 
> org.apache.tika.parser.external.ExternalParsersConfigReader.read(ExternalParsersConfigReader.java:58)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:67)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:59)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:49)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:44)
>     at 
> org.apache.tika.parser.external.CompositeExternalParser.(CompositeExternalParser.java:44)
>     at 
> org.apache.tika.parser.external.CompositeExternalParser.(CompositeExternalParser.java:37)
>     at 
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>  Method)
>     at 
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>     at 
> java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>     at 
> java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
>     at java.base/java.lang.Class.newInstance(Class.java:584)
>     at 
> org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:358)
>     at 
> org.apache.tika.parser.DefaultParser.getDefaultParsers(DefaultParser.java:55)
>     at org.apache.tika.parser.DefaultParser.(DefaultParser.java:85)
>     at org.apache.tika.parser.DefaultParser.(DefaultParser.java:100)
>     at org.apache.tika.parser.DefaultParser.(DefaultParser.java:112)
>     at org.apache.tika.parser.DefaultParser.(DefaultParser.java:116)
>     at test.Main$DFP.(Main.scala:55)
>     at test.Main$CEParser.getParser(Main.scala:75)
>     at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:269)
>     at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>     at test.Main$.parseNode(Main.scala:194)
>     at test.Main$$anon$1.parse(Main.scala:151)
>     at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
>     at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:104)
>     at 
> org.apache.tika.parser.pdf.ImageGraphicsEngine.processImage(ImageGraphicsEngine.java:321)
>     at 
> org.apache.tika.parser.pdf.ImageGraphicsEngine.drawImage(ImageGraphicsEngine.java:182)
>     at 
> org.apache.pdfbox.contentstream.operator.graphics.DrawObject.process(DrawObject.java:67)
>     at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:939)
>     at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:514)
>     at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.proce

[jira] [Commented] (TIKA-3720) IllegalArgumentException in PDF parser

2022-04-19 Thread denisn (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17524362#comment-17524362
 ] 

denisn commented on TIKA-3720:
--

I've waited the parser in 1.22 long enough and got the result (actually my 
TimeoutParser stops the parsing after 10 minutes) but in 1.23 and 1.28.1 i 
still get an error at startup:

 
{code:java}
WARNING: Tesseract OCR is installed and will be automatically applied to image 
files unless
you've excluded the TesseractOCRParser from the default parser.
Tesseract may dramatically slow down content extraction (TIKA-2359).
As of Tika 1.15 (and prior versions), Tesseract is automatically called.
In future versions of Tika, users may need to turn the TesseractOCRParser on 
via TikaConfig.
апр. 19, 2022 7:21:27 PM org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
Error: org.apache.tika.exception.TikaException: Unexpected RuntimeException 
from org.apache.tika.parser.pdf.PDFParser@5c72a85d
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.pdf.PDFParser@5c72a85d
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
    at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
    at test.Main$TimeoutParser.super$parse(Main.scala:67)
    at test.Main$TimeoutParser.$anonfun$parse$1(Main.scala:67)
    at unsafeRunSync @ test.Main$TimeoutParser.parse(Main.scala:68)
Caused by: java.lang.NullPointerException
    at 
org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:337)
    at 
org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:441)
    at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:169)
    at 
org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393)
    at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:153)
    at 
org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:867)
    at 
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
    at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:124)
    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:162)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
    at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
    at test.Main$TimeoutParser.super$parse(Main.scala:67)
    at test.Main$TimeoutParser.$anonfun$parse$1(Main.scala:67)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
    at 
cats.effect.internals.IORunLoop$.cats$effect$internals$IORunLoop$$loop(IORunLoop.scala:104)
    at 
cats.effect.internals.IORunLoop$RestartCallback.signal(IORunLoop.scala:463)
    at 
cats.effect.internals.IORunLoop$RestartCallback.apply(IORunLoop.scala:484)
    at 
cats.effect.internals.IORunLoop$RestartCallback.apply(IORunLoop.scala:422)
    at cats.effect.internals.IOShift$Tick.run(IOShift.scala:36)
    at 
java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1426)
    at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
    at 
java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
    at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
    at 
java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
    at 
java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
 {code}
 

In 2.x i've got no results in 10 minutes at all but at least the run didn't 
failed.

 

> IllegalArgumentException in PDF parser
> --
>
> Key: TIKA-3720
> URL: https://issues.apache.org/jira/browse/TIKA-3720
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.23
> Environment: Fedora 36, Java 11, Scala 2.13.4, Tika 1.28.1
>Reporter: denisn
>Priority: Major
> Attachments: test.pdf
>
>
> Tika packages: 
> {code:java}
> "org.apache.tika" % "tika" %  1.28.1
> "org.apache.tika" % "tika-core" %  1.28.1
> "org.apache.tika" % "tika-parsers" %  1.28.1
> "org.apache.poi" % "poi" % "4.0.1"
> "org.apache.poi" % "poi-ooxml" % "4.0.1"{code}
> It seems to work fine in 1.22 but in 1.23 and all following versions there is 
> an error. I've attached the pdf file which i've tested.
> Exception text:
> {code:java}
> java.lang.IllegalArgumentException
>     at org.apache.xerces.jaxp.DocumentBuilderFactoryImpl.setAttribute(Unknown 
> Source)
>     at 
> org.apache.tika.utils.XMLReaderUtils.trySetXercesSecurityManager(XMLReaderUtils.java:721)
>     at 
> org.apache.tik

[jira] [Comment Edited] (TIKA-3720) IllegalArgumentException in PDF parser

2022-04-19 Thread denisn (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17524362#comment-17524362
 ] 

denisn edited comment on TIKA-3720 at 4/19/22 2:57 PM:
---

I've waited the parser in 1.22 long enough and got the result (actually my 
TimeoutParser stops the parsing after 10 minutes) but in 1.23 and 1.28.1 i 
still get an error at startup:

1.23:
{code:java}
WARNING: Tesseract OCR is installed and will be automatically applied to image 
files unless
you've excluded the TesseractOCRParser from the default parser.
Tesseract may dramatically slow down content extraction (TIKA-2359).
As of Tika 1.15 (and prior versions), Tesseract is automatically called.
In future versions of Tika, users may need to turn the TesseractOCRParser on 
via TikaConfig.
апр. 19, 2022 7:21:27 PM org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
Error: org.apache.tika.exception.TikaException: Unexpected RuntimeException 
from org.apache.tika.parser.pdf.PDFParser@5c72a85d
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.pdf.PDFParser@5c72a85d
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
    at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
    at test.Main$TimeoutParser.super$parse(Main.scala:67)
    at test.Main$TimeoutParser.$anonfun$parse$1(Main.scala:67)
    at unsafeRunSync @ test.Main$TimeoutParser.parse(Main.scala:68)
Caused by: java.lang.NullPointerException
    at 
org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:337)
    at 
org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:441)
    at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:169)
    at 
org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393)
    at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:153)
    at 
org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:867)
    at 
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
    at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:124)
    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:162)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
    at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
    at test.Main$TimeoutParser.super$parse(Main.scala:67)
    at test.Main$TimeoutParser.$anonfun$parse$1(Main.scala:67)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
    at 
cats.effect.internals.IORunLoop$.cats$effect$internals$IORunLoop$$loop(IORunLoop.scala:104)
    at 
cats.effect.internals.IORunLoop$RestartCallback.signal(IORunLoop.scala:463)
    at 
cats.effect.internals.IORunLoop$RestartCallback.apply(IORunLoop.scala:484)
    at 
cats.effect.internals.IORunLoop$RestartCallback.apply(IORunLoop.scala:422)
    at cats.effect.internals.IOShift$Tick.run(IOShift.scala:36)
    at 
java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1426)
    at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
    at 
java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
    at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
    at 
java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
    at 
java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
 {code}
1.28.1:
{code:java}
PM org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem
WARNING: Tesseract OCR is installed and will be automatically applied to image 
files unless
you've excluded the TesseractOCRParser from the default parser.
Tesseract may dramatically slow down content extraction (TIKA-2359).
As of Tika 1.15 (and prior versions), Tesseract is automatically called.
In future versions of Tika, users may need to turn the TesseractOCRParser on 
via TikaConfig.
апр. 19, 2022 7:56:16 PM org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
Error: org.apache.tika.exception.TikaException: Unexpected RuntimeException 
from org.apache.tika.parser.pdf.PDFParser@34af2320
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.pdf.PDFParser@34af2320
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:297)
    at org.apache.tika.parser.ParserDecorator.parse(Pars

[jira] [Commented] (TIKA-3720) IllegalArgumentException in PDF parser

2022-04-19 Thread denisn (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17524368#comment-17524368
 ] 

denisn commented on TIKA-3720:
--

Well it's a different problem it seems. Should i open another issue?

> IllegalArgumentException in PDF parser
> --
>
> Key: TIKA-3720
> URL: https://issues.apache.org/jira/browse/TIKA-3720
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.23
> Environment: Fedora 36, Java 11, Scala 2.13.4, Tika 1.28.1
>Reporter: denisn
>Priority: Major
> Attachments: test.pdf
>
>
> Tika packages: 
> {code:java}
> "org.apache.tika" % "tika" %  1.28.1
> "org.apache.tika" % "tika-core" %  1.28.1
> "org.apache.tika" % "tika-parsers" %  1.28.1
> "org.apache.poi" % "poi" % "4.0.1"
> "org.apache.poi" % "poi-ooxml" % "4.0.1"{code}
> It seems to work fine in 1.22 but in 1.23 and all following versions there is 
> an error. I've attached the pdf file which i've tested.
> Exception text:
> {code:java}
> java.lang.IllegalArgumentException
>     at org.apache.xerces.jaxp.DocumentBuilderFactoryImpl.setAttribute(Unknown 
> Source)
>     at 
> org.apache.tika.utils.XMLReaderUtils.trySetXercesSecurityManager(XMLReaderUtils.java:721)
>     at 
> org.apache.tika.utils.XMLReaderUtils.getDocumentBuilderFactory(XMLReaderUtils.java:289)
>     at 
> org.apache.tika.utils.XMLReaderUtils.getDocumentBuilder(XMLReaderUtils.java:305)
>     at 
> org.apache.tika.parser.external.ExternalParsersConfigReader.read(ExternalParsersConfigReader.java:58)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:67)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:59)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:49)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:44)
>     at 
> org.apache.tika.parser.external.CompositeExternalParser.(CompositeExternalParser.java:44)
>     at 
> org.apache.tika.parser.external.CompositeExternalParser.(CompositeExternalParser.java:37)
>     at 
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>  Method)
>     at 
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>     at 
> java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>     at 
> java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
>     at java.base/java.lang.Class.newInstance(Class.java:584)
>     at 
> org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:358)
>     at 
> org.apache.tika.parser.DefaultParser.getDefaultParsers(DefaultParser.java:55)
>     at org.apache.tika.parser.DefaultParser.(DefaultParser.java:85)
>     at org.apache.tika.parser.DefaultParser.(DefaultParser.java:100)
>     at org.apache.tika.parser.DefaultParser.(DefaultParser.java:112)
>     at org.apache.tika.parser.DefaultParser.(DefaultParser.java:116)
>     at test.Main$DFP.(Main.scala:55)
>     at test.Main$CEParser.getParser(Main.scala:75)
>     at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:269)
>     at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>     at test.Main$.parseNode(Main.scala:194)
>     at test.Main$$anon$1.parse(Main.scala:151)
>     at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
>     at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:104)
>     at 
> org.apache.tika.parser.pdf.ImageGraphicsEngine.processImage(ImageGraphicsEngine.java:321)
>     at 
> org.apache.tika.parser.pdf.ImageGraphicsEngine.drawImage(ImageGraphicsEngine.java:182)
>     at 
> org.apache.pdfbox.contentstream.operator.graphics.DrawObject.process(DrawObject.java:67)
>     at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:939)
>     at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:514)
>     at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:492)
>     at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:155)
>     at 
> org.apache.tika.parser.pdf.ImageGraphicsEngine.run(ImageGraphicsEngine.java:128)
>     at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:159)
>     at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:139)
>     at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:365)
>     at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:127)
>   

[jira] [Commented] (TIKA-3720) IllegalArgumentException in PDF parser

2022-04-19 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17524376#comment-17524376
 ] 

Tim Allison commented on TIKA-3720:
---

The problem with 1.x or 2.x? 

1.x (I think) is caused by your custom parser not extending CompositeParser or 
ParserDecorator.  We effectively fixed that in 2.x.

In 2.x, I don't think there's a problem.  You've selected 
"OCR_AND_TEXT_EXTRACTION", which means that Tika will run OCR on every page of 
your PDF.  That can take quite a bit of time.

> IllegalArgumentException in PDF parser
> --
>
> Key: TIKA-3720
> URL: https://issues.apache.org/jira/browse/TIKA-3720
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.23
> Environment: Fedora 36, Java 11, Scala 2.13.4, Tika 1.28.1
>Reporter: denisn
>Priority: Major
> Attachments: test.pdf
>
>
> Tika packages: 
> {code:java}
> "org.apache.tika" % "tika" %  1.28.1
> "org.apache.tika" % "tika-core" %  1.28.1
> "org.apache.tika" % "tika-parsers" %  1.28.1
> "org.apache.poi" % "poi" % "4.0.1"
> "org.apache.poi" % "poi-ooxml" % "4.0.1"{code}
> It seems to work fine in 1.22 but in 1.23 and all following versions there is 
> an error. I've attached the pdf file which i've tested.
> Exception text:
> {code:java}
> java.lang.IllegalArgumentException
>     at org.apache.xerces.jaxp.DocumentBuilderFactoryImpl.setAttribute(Unknown 
> Source)
>     at 
> org.apache.tika.utils.XMLReaderUtils.trySetXercesSecurityManager(XMLReaderUtils.java:721)
>     at 
> org.apache.tika.utils.XMLReaderUtils.getDocumentBuilderFactory(XMLReaderUtils.java:289)
>     at 
> org.apache.tika.utils.XMLReaderUtils.getDocumentBuilder(XMLReaderUtils.java:305)
>     at 
> org.apache.tika.parser.external.ExternalParsersConfigReader.read(ExternalParsersConfigReader.java:58)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:67)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:59)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:49)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:44)
>     at 
> org.apache.tika.parser.external.CompositeExternalParser.(CompositeExternalParser.java:44)
>     at 
> org.apache.tika.parser.external.CompositeExternalParser.(CompositeExternalParser.java:37)
>     at 
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>  Method)
>     at 
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>     at 
> java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>     at 
> java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
>     at java.base/java.lang.Class.newInstance(Class.java:584)
>     at 
> org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:358)
>     at 
> org.apache.tika.parser.DefaultParser.getDefaultParsers(DefaultParser.java:55)
>     at org.apache.tika.parser.DefaultParser.(DefaultParser.java:85)
>     at org.apache.tika.parser.DefaultParser.(DefaultParser.java:100)
>     at org.apache.tika.parser.DefaultParser.(DefaultParser.java:112)
>     at org.apache.tika.parser.DefaultParser.(DefaultParser.java:116)
>     at test.Main$DFP.(Main.scala:55)
>     at test.Main$CEParser.getParser(Main.scala:75)
>     at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:269)
>     at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>     at test.Main$.parseNode(Main.scala:194)
>     at test.Main$$anon$1.parse(Main.scala:151)
>     at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
>     at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:104)
>     at 
> org.apache.tika.parser.pdf.ImageGraphicsEngine.processImage(ImageGraphicsEngine.java:321)
>     at 
> org.apache.tika.parser.pdf.ImageGraphicsEngine.drawImage(ImageGraphicsEngine.java:182)
>     at 
> org.apache.pdfbox.contentstream.operator.graphics.DrawObject.process(DrawObject.java:67)
>     at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:939)
>     at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:514)
>     at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:492)
>     at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:155)
>     at 
> org.apache.tika.parser.pdf.ImageGraphicsEngine.run(ImageGraphicsEngine.java:128)
>     at org.apache.tika.parser.pdf

[jira] [Commented] (TIKA-3720) IllegalArgumentException in PDF parser

2022-04-19 Thread denisn (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17524386#comment-17524386
 ] 

denisn commented on TIKA-3720:
--

I guess the problem is only in 1.x starting with 1.23. 1.23 and 1.28.1 are 
failing right after startup and not extracting anything from my pdf. My parser 
extends the ParserDecorator at some point:
{code:java}

private class TimeoutParser(parser: Parser) extends ParserDecorator(parser) 

private class CEParser extends AutoDetectParser {

  override protected def getParser(metadata: Metadata, context: ParseContext): 
Parser = {
val parser = super.getParser(metadata, context)
val p  = if (parser.isInstanceOf[DefaultParser]) new 
DFP().getParserPublic(metadata, context) else parser
new TimeoutParser(parser)
  }
}

private val parser = new CEParser()

val contentParser = new AbstractParser {
  def parse = parser.parse
}

val currentContext = new ParseContext

currentContext.set(classOf[TesseractOCRConfig], ocrConfig)
currentContext.set(classOf[PDFParserConfig], pdfConfig)
currentContext.set(classOf[Parser], contentParser){code}
 

2.x is not failing at startup (i just didn't wait long enough to get the 
results).

> IllegalArgumentException in PDF parser
> --
>
> Key: TIKA-3720
> URL: https://issues.apache.org/jira/browse/TIKA-3720
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.23
> Environment: Fedora 36, Java 11, Scala 2.13.4, Tika 1.28.1
>Reporter: denisn
>Priority: Major
> Attachments: test.pdf
>
>
> Tika packages: 
> {code:java}
> "org.apache.tika" % "tika" %  1.28.1
> "org.apache.tika" % "tika-core" %  1.28.1
> "org.apache.tika" % "tika-parsers" %  1.28.1
> "org.apache.poi" % "poi" % "4.0.1"
> "org.apache.poi" % "poi-ooxml" % "4.0.1"{code}
> It seems to work fine in 1.22 but in 1.23 and all following versions there is 
> an error. I've attached the pdf file which i've tested.
> Exception text:
> {code:java}
> java.lang.IllegalArgumentException
>     at org.apache.xerces.jaxp.DocumentBuilderFactoryImpl.setAttribute(Unknown 
> Source)
>     at 
> org.apache.tika.utils.XMLReaderUtils.trySetXercesSecurityManager(XMLReaderUtils.java:721)
>     at 
> org.apache.tika.utils.XMLReaderUtils.getDocumentBuilderFactory(XMLReaderUtils.java:289)
>     at 
> org.apache.tika.utils.XMLReaderUtils.getDocumentBuilder(XMLReaderUtils.java:305)
>     at 
> org.apache.tika.parser.external.ExternalParsersConfigReader.read(ExternalParsersConfigReader.java:58)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:67)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:59)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:49)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:44)
>     at 
> org.apache.tika.parser.external.CompositeExternalParser.(CompositeExternalParser.java:44)
>     at 
> org.apache.tika.parser.external.CompositeExternalParser.(CompositeExternalParser.java:37)
>     at 
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>  Method)
>     at 
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>     at 
> java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>     at 
> java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
>     at java.base/java.lang.Class.newInstance(Class.java:584)
>     at 
> org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:358)
>     at 
> org.apache.tika.parser.DefaultParser.getDefaultParsers(DefaultParser.java:55)
>     at org.apache.tika.parser.DefaultParser.(DefaultParser.java:85)
>     at org.apache.tika.parser.DefaultParser.(DefaultParser.java:100)
>     at org.apache.tika.parser.DefaultParser.(DefaultParser.java:112)
>     at org.apache.tika.parser.DefaultParser.(DefaultParser.java:116)
>     at test.Main$DFP.(Main.scala:55)
>     at test.Main$CEParser.getParser(Main.scala:75)
>     at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:269)
>     at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>     at test.Main$.parseNode(Main.scala:194)
>     at test.Main$$anon$1.parse(Main.scala:151)
>     at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
>     at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:104)
>     at 
> org.apache.tika.parser.pdf.ImageGraphicsEngine.processImage(ImageGraphicsEngine.java:321)
>     at 
> org.apache.tika.parser.pdf.ImageGraphicsEn

[jira] [Comment Edited] (TIKA-3720) IllegalArgumentException in PDF parser

2022-04-19 Thread denisn (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17524386#comment-17524386
 ] 

denisn edited comment on TIKA-3720 at 4/19/22 3:25 PM:
---

I guess the problem is only in 1.x starting with 1.23. 1.23 and 1.28.1 are 
failing right after startup and not extracting anything from my pdf. My parser 
extends the ParserDecorator at some point:
{code:java}
private class TimeoutParser(parser: Parser) extends ParserDecorator(parser) 

private class CEParser extends AutoDetectParser {

  override protected def getParser(metadata: Metadata, context: ParseContext): 
Parser = {
val parser = super.getParser(metadata, context)
new TimeoutParser(parser)
  }
}

private val parser = new CEParser()

val contentParser = new AbstractParser {
  def parse = parser.parse
}

val currentContext = new ParseContext

currentContext.set(classOf[TesseractOCRConfig], ocrConfig)
currentContext.set(classOf[PDFParserConfig], pdfConfig)
currentContext.set(classOf[Parser], contentParser){code}
 

2.x is not failing at startup (i just didn't wait long enough to get the 
results).


was (Author: JIRAUSER288220):
I guess the problem is only in 1.x starting with 1.23. 1.23 and 1.28.1 are 
failing right after startup and not extracting anything from my pdf. My parser 
extends the ParserDecorator at some point:
{code:java}

private class TimeoutParser(parser: Parser) extends ParserDecorator(parser) 

private class CEParser extends AutoDetectParser {

  override protected def getParser(metadata: Metadata, context: ParseContext): 
Parser = {
val parser = super.getParser(metadata, context)
val p  = if (parser.isInstanceOf[DefaultParser]) new 
DFP().getParserPublic(metadata, context) else parser
new TimeoutParser(parser)
  }
}

private val parser = new CEParser()

val contentParser = new AbstractParser {
  def parse = parser.parse
}

val currentContext = new ParseContext

currentContext.set(classOf[TesseractOCRConfig], ocrConfig)
currentContext.set(classOf[PDFParserConfig], pdfConfig)
currentContext.set(classOf[Parser], contentParser){code}
 

2.x is not failing at startup (i just didn't wait long enough to get the 
results).

> IllegalArgumentException in PDF parser
> --
>
> Key: TIKA-3720
> URL: https://issues.apache.org/jira/browse/TIKA-3720
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.23
> Environment: Fedora 36, Java 11, Scala 2.13.4, Tika 1.28.1
>Reporter: denisn
>Priority: Major
> Attachments: test.pdf
>
>
> Tika packages: 
> {code:java}
> "org.apache.tika" % "tika" %  1.28.1
> "org.apache.tika" % "tika-core" %  1.28.1
> "org.apache.tika" % "tika-parsers" %  1.28.1
> "org.apache.poi" % "poi" % "4.0.1"
> "org.apache.poi" % "poi-ooxml" % "4.0.1"{code}
> It seems to work fine in 1.22 but in 1.23 and all following versions there is 
> an error. I've attached the pdf file which i've tested.
> Exception text:
> {code:java}
> java.lang.IllegalArgumentException
>     at org.apache.xerces.jaxp.DocumentBuilderFactoryImpl.setAttribute(Unknown 
> Source)
>     at 
> org.apache.tika.utils.XMLReaderUtils.trySetXercesSecurityManager(XMLReaderUtils.java:721)
>     at 
> org.apache.tika.utils.XMLReaderUtils.getDocumentBuilderFactory(XMLReaderUtils.java:289)
>     at 
> org.apache.tika.utils.XMLReaderUtils.getDocumentBuilder(XMLReaderUtils.java:305)
>     at 
> org.apache.tika.parser.external.ExternalParsersConfigReader.read(ExternalParsersConfigReader.java:58)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:67)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:59)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:49)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:44)
>     at 
> org.apache.tika.parser.external.CompositeExternalParser.(CompositeExternalParser.java:44)
>     at 
> org.apache.tika.parser.external.CompositeExternalParser.(CompositeExternalParser.java:37)
>     at 
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>  Method)
>     at 
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>     at 
> java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>     at 
> java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
>     at java.base/java.lang.Class.newInstance(Class.java:584)
>     at 
> org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:358)
>     at 
> org.apache.tika.parser.DefaultParser.getDefaultParsers(DefaultParser.ja

[jira] [Commented] (TIKA-3720) IllegalArgumentException in PDF parser

2022-04-19 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17524515#comment-17524515
 ] 

Tim Allison commented on TIKA-3720:
---

Interesting.  The issue (I think) is that you're "hiding" your CEParser in your 
"contentParser".  That is, your outermost parser is an abstract parser, not a 
ParserDecorator.  This means that when the PDFParser tries to find the 
underlying parsers, it hits your AbstractParser and stops.  Try sending your 
parser into the currentContext instead of your contentParser.

> IllegalArgumentException in PDF parser
> --
>
> Key: TIKA-3720
> URL: https://issues.apache.org/jira/browse/TIKA-3720
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.23
> Environment: Fedora 36, Java 11, Scala 2.13.4, Tika 1.28.1
>Reporter: denisn
>Priority: Major
> Attachments: test.pdf
>
>
> Tika packages: 
> {code:java}
> "org.apache.tika" % "tika" %  1.28.1
> "org.apache.tika" % "tika-core" %  1.28.1
> "org.apache.tika" % "tika-parsers" %  1.28.1
> "org.apache.poi" % "poi" % "4.0.1"
> "org.apache.poi" % "poi-ooxml" % "4.0.1"{code}
> It seems to work fine in 1.22 but in 1.23 and all following versions there is 
> an error. I've attached the pdf file which i've tested.
> Exception text:
> {code:java}
> java.lang.IllegalArgumentException
>     at org.apache.xerces.jaxp.DocumentBuilderFactoryImpl.setAttribute(Unknown 
> Source)
>     at 
> org.apache.tika.utils.XMLReaderUtils.trySetXercesSecurityManager(XMLReaderUtils.java:721)
>     at 
> org.apache.tika.utils.XMLReaderUtils.getDocumentBuilderFactory(XMLReaderUtils.java:289)
>     at 
> org.apache.tika.utils.XMLReaderUtils.getDocumentBuilder(XMLReaderUtils.java:305)
>     at 
> org.apache.tika.parser.external.ExternalParsersConfigReader.read(ExternalParsersConfigReader.java:58)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:67)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:59)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:49)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:44)
>     at 
> org.apache.tika.parser.external.CompositeExternalParser.(CompositeExternalParser.java:44)
>     at 
> org.apache.tika.parser.external.CompositeExternalParser.(CompositeExternalParser.java:37)
>     at 
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>  Method)
>     at 
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>     at 
> java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>     at 
> java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
>     at java.base/java.lang.Class.newInstance(Class.java:584)
>     at 
> org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:358)
>     at 
> org.apache.tika.parser.DefaultParser.getDefaultParsers(DefaultParser.java:55)
>     at org.apache.tika.parser.DefaultParser.(DefaultParser.java:85)
>     at org.apache.tika.parser.DefaultParser.(DefaultParser.java:100)
>     at org.apache.tika.parser.DefaultParser.(DefaultParser.java:112)
>     at org.apache.tika.parser.DefaultParser.(DefaultParser.java:116)
>     at test.Main$DFP.(Main.scala:55)
>     at test.Main$CEParser.getParser(Main.scala:75)
>     at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:269)
>     at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>     at test.Main$.parseNode(Main.scala:194)
>     at test.Main$$anon$1.parse(Main.scala:151)
>     at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
>     at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:104)
>     at 
> org.apache.tika.parser.pdf.ImageGraphicsEngine.processImage(ImageGraphicsEngine.java:321)
>     at 
> org.apache.tika.parser.pdf.ImageGraphicsEngine.drawImage(ImageGraphicsEngine.java:182)
>     at 
> org.apache.pdfbox.contentstream.operator.graphics.DrawObject.process(DrawObject.java:67)
>     at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:939)
>     at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:514)
>     at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:492)
>     at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:155)
>     at 
> org.apache.tika.parser.pdf.ImageGraphicsEngine.run(ImageGraphicsEngine.java:128)
>     at org

[jira] [Commented] (TIKA-3719) Tika Server Ability to Run HTTPs

2022-04-19 Thread Daniel Coldrick (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17524526#comment-17524526
 ] 

Daniel Coldrick commented on TIKA-3719:
---

I managed to get it working with https by using this example under "Configuring 
[HTTP/2|https://http2.github.io/] over 
[TLS":|https://en.wikipedia.org/wiki/Transport_Layer_Security]

[https://www.javacodegeeks.com/2022/01/so-you-want-to-expose-your-jax-rs-services-over-http-2.html]

 

 
{code:java}
String url = "https://"; + host + ":" + port + "/";
KeyStoreType keystore = new KeyStoreType();
keystore.setType("JKS");
keystore.setPassword("1");
keystore.setResource("keystore.jks");
KeyManagersType kmt = new KeyManagersType();
kmt.setKeyStore(keystore);
kmt.setKeyPassword("1");
TLSServerParameters parameters = new TLSServerParameters();
parameters.setKeyManagers(TLSParameterJaxBUtils.getKeyManagers(kmt));
JettyHTTPServerEngineFactory factory = new JettyHTTPServerEngineFactory();
sf.setAddress(url);
sf.setResourceComparator(new ProduceTypeResourceComparator());                  
                                                                                
                                                    
factory.setBus(sf.getBus());
BindingFactoryManager manager = 
sf.getBus().getExtension(BindingFactoryManager.class);
factory.setTLSServerParametersForPort(host, port, parameters);      
JAXRSServerFactoryCustomizationUtils.customize(sf); {code}
 

 

I changed the above code in TikaServerProcess and managed to spawn a https 
Jetty Server, hopefully that might be of some use to you guys?

 

 

> Tika Server Ability to Run HTTPs
> 
>
> Key: TIKA-3719
> URL: https://issues.apache.org/jira/browse/TIKA-3719
> Project: Tika
>  Issue Type: Wish
>  Components: tika-server
>Affects Versions: 2.3.0
>Reporter: Daniel Coldrick
>Priority: Minor
>
> We need the ability to run TIKA server as a https end point, I can't see 
> anything in the config that allows for this. 
> Looks like I'm not the only one:
> [https://stackoverflow.com/questions/7031/apache-tika-convert-apache-tika-server-rest-endpointsjax-rs-http-to-https]
>  
> If anyone can point to some documentation on how it might be possible it 
> would be really appreciated.
>  
> Thanks



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (TIKA-3721) DGN parser

2022-04-19 Thread Dan Coldrick (Jira)
Dan Coldrick created TIKA-3721:
--

 Summary: DGN parser
 Key: TIKA-3721
 URL: https://issues.apache.org/jira/browse/TIKA-3721
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 2.3.0
Reporter: Dan Coldrick


Does anyone have any experience with the DGN file format by MicroStation? I see 
TIKA doesn't have a parser so would it be possible to create one? 

https://docs.fileformat.com/cad/dgn/



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (TIKA-3722) OOM exception on xlsx parsing

2022-04-19 Thread sagi shechter (Jira)
sagi shechter created TIKA-3722:
---

 Summary: OOM exception on xlsx parsing
 Key: TIKA-3722
 URL: https://issues.apache.org/jira/browse/TIKA-3722
 Project: Tika
  Issue Type: Bug
Reporter: sagi shechter


 
{code:java}
The full exception stack trace is included below:
java.lang.OutOfMemoryError: Java heap space
    at java.base/java.util.Arrays.copyOf(Arrays.java:3817)
    at java.base/java.util.BitSet.ensureCapacity(BitSet.java:338)
    at java.base/java.util.BitSet.expandTo(BitSet.java:353)
    at java.base/java.util.BitSet.set(BitSet.java:448)
    at 
de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.characters(BoilerpipeHTMLContentHandler.java:267)
    at 
org.apache.tika.sax.boilerpipe.BoilerpipeContentHandler.characters(BoilerpipeContentHandler.java:165)
    at 
org.apache.tika.sax.TeeContentHandler.characters(TeeContentHandler.java:97)
    at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
    at 
org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
    at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
    at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
    at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
    at 
org.apache.tika.sax.SafeContentHandler.access$201(SafeContentHandler.java:47)
    at 
org.apache.tika.sax.SafeContentHandler.lambda$new$0(SafeContentHandler.java:57)
    at 
org.apache.tika.sax.SafeContentHandler$$Lambda$515/0x000800506c40.write(Unknown
 Source)
    at 
org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:106)
    at 
org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:250)
    at 
org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:270)
    at 
org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:295)
    at 
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator$SheetTextAsHTML.cell(XSSFExcelExtractorDecorator.java:473)
    at 
org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.outputCell(XSSFSheetXMLHandler.java:444)
    at 
org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.endElement(XSSFSheetXMLHandler.java:317)
    at 
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator$XSSFSheetInterestingPartsCapturer.endElement(XSSFExcelExtractorDecorator.java:561)
    at 
org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:132)
    at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
    at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanEndElement(Unknown 
Source)
    at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
 Source)
    at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
{code}
 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (TIKA-3722) OOM exception on xlsx parsing

2022-04-19 Thread sagi shechter (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sagi shechter updated TIKA-3722:

Attachment: records_headers_02.xlsx

> OOM exception on xlsx parsing
> -
>
> Key: TIKA-3722
> URL: https://issues.apache.org/jira/browse/TIKA-3722
> Project: Tika
>  Issue Type: Bug
>Reporter: sagi shechter
>Priority: Major
> Attachments: records_headers_02.xlsx
>
>
>  
> {code:java}
> The full exception stack trace is included below:
> java.lang.OutOfMemoryError: Java heap space
>     at java.base/java.util.Arrays.copyOf(Arrays.java:3817)
>     at java.base/java.util.BitSet.ensureCapacity(BitSet.java:338)
>     at java.base/java.util.BitSet.expandTo(BitSet.java:353)
>     at java.base/java.util.BitSet.set(BitSet.java:448)
>     at 
> de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.characters(BoilerpipeHTMLContentHandler.java:267)
>     at 
> org.apache.tika.sax.boilerpipe.BoilerpipeContentHandler.characters(BoilerpipeContentHandler.java:165)
>     at 
> org.apache.tika.sax.TeeContentHandler.characters(TeeContentHandler.java:97)
>     at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
>     at 
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
>     at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
>     at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
>     at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
>     at 
> org.apache.tika.sax.SafeContentHandler.access$201(SafeContentHandler.java:47)
>     at 
> org.apache.tika.sax.SafeContentHandler.lambda$new$0(SafeContentHandler.java:57)
>     at 
> org.apache.tika.sax.SafeContentHandler$$Lambda$515/0x000800506c40.write(Unknown
>  Source)
>     at 
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:106)
>     at 
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:250)
>     at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:270)
>     at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:295)
>     at 
> org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator$SheetTextAsHTML.cell(XSSFExcelExtractorDecorator.java:473)
>     at 
> org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.outputCell(XSSFSheetXMLHandler.java:444)
>     at 
> org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.endElement(XSSFSheetXMLHandler.java:317)
>     at 
> org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator$XSSFSheetInterestingPartsCapturer.endElement(XSSFExcelExtractorDecorator.java:561)
>     at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:132)
>     at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
>     at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanEndElement(Unknown 
> Source)
>     at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
>  Source)
>     at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
> Source)
>     at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>     at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>     at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>     at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (TIKA-3722) OOM exception on xlsx parsing

2022-04-19 Thread sagi shechter (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sagi shechter updated TIKA-3722:

Attachment: (was: records_headers_02.xlsx)

> OOM exception on xlsx parsing
> -
>
> Key: TIKA-3722
> URL: https://issues.apache.org/jira/browse/TIKA-3722
> Project: Tika
>  Issue Type: Bug
>Reporter: sagi shechter
>Priority: Major
>
>  
> {code:java}
> The full exception stack trace is included below:
> java.lang.OutOfMemoryError: Java heap space
>     at java.base/java.util.Arrays.copyOf(Arrays.java:3817)
>     at java.base/java.util.BitSet.ensureCapacity(BitSet.java:338)
>     at java.base/java.util.BitSet.expandTo(BitSet.java:353)
>     at java.base/java.util.BitSet.set(BitSet.java:448)
>     at 
> de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.characters(BoilerpipeHTMLContentHandler.java:267)
>     at 
> org.apache.tika.sax.boilerpipe.BoilerpipeContentHandler.characters(BoilerpipeContentHandler.java:165)
>     at 
> org.apache.tika.sax.TeeContentHandler.characters(TeeContentHandler.java:97)
>     at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
>     at 
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
>     at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
>     at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
>     at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
>     at 
> org.apache.tika.sax.SafeContentHandler.access$201(SafeContentHandler.java:47)
>     at 
> org.apache.tika.sax.SafeContentHandler.lambda$new$0(SafeContentHandler.java:57)
>     at 
> org.apache.tika.sax.SafeContentHandler$$Lambda$515/0x000800506c40.write(Unknown
>  Source)
>     at 
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:106)
>     at 
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:250)
>     at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:270)
>     at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:295)
>     at 
> org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator$SheetTextAsHTML.cell(XSSFExcelExtractorDecorator.java:473)
>     at 
> org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.outputCell(XSSFSheetXMLHandler.java:444)
>     at 
> org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.endElement(XSSFSheetXMLHandler.java:317)
>     at 
> org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator$XSSFSheetInterestingPartsCapturer.endElement(XSSFExcelExtractorDecorator.java:561)
>     at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:132)
>     at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
>     at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanEndElement(Unknown 
> Source)
>     at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
>  Source)
>     at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
> Source)
>     at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>     at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>     at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>     at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (TIKA-3722) OOM exception on xlsx parsing

2022-04-19 Thread sagi shechter (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sagi shechter updated TIKA-3722:

Description: 
The file is ~3mb , fails with tika app 2.3.0
{code:java}
java.lang.OutOfMemoryError: Java heap space
    at java.base/java.util.Arrays.copyOf(Arrays.java:3817)
    at java.base/java.util.BitSet.ensureCapacity(BitSet.java:338)
    at java.base/java.util.BitSet.expandTo(BitSet.java:353)
    at java.base/java.util.BitSet.set(BitSet.java:448)
    at 
de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.characters(BoilerpipeHTMLContentHandler.java:267)
    at 
org.apache.tika.sax.boilerpipe.BoilerpipeContentHandler.characters(BoilerpipeContentHandler.java:165)
    at 
org.apache.tika.sax.TeeContentHandler.characters(TeeContentHandler.java:97)
    at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
    at 
org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
    at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
    at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
    at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
    at 
org.apache.tika.sax.SafeContentHandler.access$201(SafeContentHandler.java:47)
    at 
org.apache.tika.sax.SafeContentHandler.lambda$new$0(SafeContentHandler.java:57)
    at 
org.apache.tika.sax.SafeContentHandler$$Lambda$515/0x000800506c40.write(Unknown
 Source)
    at 
org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:106)
    at 
org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:250)
    at 
org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:270)
    at 
org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:295)
    at 
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator$SheetTextAsHTML.cell(XSSFExcelExtractorDecorator.java:473)
    at 
org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.outputCell(XSSFSheetXMLHandler.java:444)
    at 
org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.endElement(XSSFSheetXMLHandler.java:317)
    at 
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator$XSSFSheetInterestingPartsCapturer.endElement(XSSFExcelExtractorDecorator.java:561)
    at 
org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:132)
    at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
    at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanEndElement(Unknown 
Source)
    at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
 Source)
    at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
{code}
 

  was:
 
{code:java}
The full exception stack trace is included below:
java.lang.OutOfMemoryError: Java heap space
    at java.base/java.util.Arrays.copyOf(Arrays.java:3817)
    at java.base/java.util.BitSet.ensureCapacity(BitSet.java:338)
    at java.base/java.util.BitSet.expandTo(BitSet.java:353)
    at java.base/java.util.BitSet.set(BitSet.java:448)
    at 
de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.characters(BoilerpipeHTMLContentHandler.java:267)
    at 
org.apache.tika.sax.boilerpipe.BoilerpipeContentHandler.characters(BoilerpipeContentHandler.java:165)
    at 
org.apache.tika.sax.TeeContentHandler.characters(TeeContentHandler.java:97)
    at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
    at 
org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
    at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
    at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
    at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
    at 
org.apache.tika.sax.SafeContentHandler.access$201(SafeContentHandler.java:47)
    at 
org.apache.tika.sax.SafeContentHandler.lambda$new$0(SafeContentHandler.java:57)
    at 
org.apache.tika.sax.SafeContentHandler$$Lambda$515/0x000800506c40.write(Unknown
 Source)
    at 
org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:106)
    at 
org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:250)
    at 
org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:270)
    at 
org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:295)
    at 
org.apach

[jira] [Updated] (TIKA-3722) OOM exception on xlsx parsing

2022-04-19 Thread sagi shechter (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sagi shechter updated TIKA-3722:

Affects Version/s: 2.3.0

> OOM exception on xlsx parsing
> -
>
> Key: TIKA-3722
> URL: https://issues.apache.org/jira/browse/TIKA-3722
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.3.0
>Reporter: sagi shechter
>Priority: Major
>
> The file is ~3mb , fails with tika app 2.3.0
> {code:java}
> java.lang.OutOfMemoryError: Java heap space
>     at java.base/java.util.Arrays.copyOf(Arrays.java:3817)
>     at java.base/java.util.BitSet.ensureCapacity(BitSet.java:338)
>     at java.base/java.util.BitSet.expandTo(BitSet.java:353)
>     at java.base/java.util.BitSet.set(BitSet.java:448)
>     at 
> de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.characters(BoilerpipeHTMLContentHandler.java:267)
>     at 
> org.apache.tika.sax.boilerpipe.BoilerpipeContentHandler.characters(BoilerpipeContentHandler.java:165)
>     at 
> org.apache.tika.sax.TeeContentHandler.characters(TeeContentHandler.java:97)
>     at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
>     at 
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
>     at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
>     at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
>     at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
>     at 
> org.apache.tika.sax.SafeContentHandler.access$201(SafeContentHandler.java:47)
>     at 
> org.apache.tika.sax.SafeContentHandler.lambda$new$0(SafeContentHandler.java:57)
>     at 
> org.apache.tika.sax.SafeContentHandler$$Lambda$515/0x000800506c40.write(Unknown
>  Source)
>     at 
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:106)
>     at 
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:250)
>     at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:270)
>     at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:295)
>     at 
> org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator$SheetTextAsHTML.cell(XSSFExcelExtractorDecorator.java:473)
>     at 
> org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.outputCell(XSSFSheetXMLHandler.java:444)
>     at 
> org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.endElement(XSSFSheetXMLHandler.java:317)
>     at 
> org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator$XSSFSheetInterestingPartsCapturer.endElement(XSSFExcelExtractorDecorator.java:561)
>     at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:132)
>     at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
>     at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanEndElement(Unknown 
> Source)
>     at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
>  Source)
>     at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
> Source)
>     at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>     at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>     at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>     at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3719) Tika Server Ability to Run HTTPs

2022-04-19 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17524570#comment-17524570
 ] 

Tim Allison commented on TIKA-3719:
---

Yes, yes, and more yes!  Thank you!  

How can we parameterize this?  Or, I can figure out the mechanism, what 
parameters do we want to make settable/configurable?

> Tika Server Ability to Run HTTPs
> 
>
> Key: TIKA-3719
> URL: https://issues.apache.org/jira/browse/TIKA-3719
> Project: Tika
>  Issue Type: Wish
>  Components: tika-server
>Affects Versions: 2.3.0
>Reporter: Dan Coldrick
>Priority: Minor
>
> We need the ability to run TIKA server as a https end point, I can't see 
> anything in the config that allows for this. 
> Looks like I'm not the only one:
> [https://stackoverflow.com/questions/7031/apache-tika-convert-apache-tika-server-rest-endpointsjax-rs-http-to-https]
>  
> If anyone can point to some documentation on how it might be possible it 
> would be really appreciated.
>  
> Thanks



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3719) Tika Server Ability to Run HTTPs

2022-04-19 Thread Dan Coldrick (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17524588#comment-17524588
 ] 

Dan Coldrick commented on TIKA-3719:


Hi [~tallison] 

I'm far from being a java developer so not sure how much further I can help but 
how about adding some parameter to the xml config file? Something like:
{code:java}

    
        
            
                true
                JKS
                1
                c:/temp/keystore.jks
                JKS
                1
                c:/temp/keystore.jks
            
        
    

{code}
Also holding keystore passwords in clear text doesn't feel right to me so might 
have to do something around encrypting them somehow.

Next step would also to add some Authorization (Basic Auth would be a good 
start :) ) to the server but maybe that would be a separate feature? Would that 
be worthwhile raising?

> Tika Server Ability to Run HTTPs
> 
>
> Key: TIKA-3719
> URL: https://issues.apache.org/jira/browse/TIKA-3719
> Project: Tika
>  Issue Type: Wish
>  Components: tika-server
>Affects Versions: 2.3.0
>Reporter: Dan Coldrick
>Priority: Minor
>
> We need the ability to run TIKA server as a https end point, I can't see 
> anything in the config that allows for this. 
> Looks like I'm not the only one:
> [https://stackoverflow.com/questions/7031/apache-tika-convert-apache-tika-server-rest-endpointsjax-rs-http-to-https]
>  
> If anyone can point to some documentation on how it might be possible it 
> would be really appreciated.
>  
> Thanks



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


New JDK 19 EA builds and JCE Survey!

2022-04-19 Thread David Delabassee

Greetings!

The proposed schedule for JDK 19 is now known [1] with ‘Rampdown Phase 
One’ set for June 9th and ‘General Availability’ set for September 20th. 
The next several weeks will be interesting to watch as the scope of JDK 
19 is revealed.


You also play an important roll during these phases, which is your 
opportunity to share feedback . When developers such as yourself tell us 
of issues faced in the latest OpenJDK early-access (EA) builds, we then 
have a chance to fix them before that feature release reaches general 
availability (GA).


We also enjoy when people tell us that all their tests are green! It 
gives us confidence ;-) So regardless of the tests color (red or green), 
please do not hesitate to send me a short mail as both types of feedback 
are useful.


[1] https://mail.openjdk.java.net/pipermail/jdk-dev/2022-April/006481.html


## Heads-Up: Java Cryptographic Extension Survey

The Java Cryptographic Extension (JCE) has been in Java SE for a long 
time and has made incremental changes over the years. The OpenJDK 
Security Team is conducting a survey [2] to know more about how projects 
are using JCE and what changes, features, and API enhancements would be 
useful going forward.


The survey is clossing on April 29 so if you have written or maintain 
code that uses the JCE API, please make sure to fill this short survey 
[2] as soon as possible.


[2] https://www.questionpro.com/t/AUzP7ZrFWv


## Heads-Up: New macOS Rendering Pipeline on macOS

JEP 382 [3] introduced in JDK 17 support for the new macOS Metal 
graphics pipeline for Swing and Java2D. JDK 19 starting build 18 is 
switching the default to be the new macOS Metal rendering pipeline 
instead of the old Apple OpenGL API. For more details please see 
JDK-8284378 [4].


Java applications running on macOS (10.14 or later) will not need to 
take any action, as they will automatically benefit from faster graphics 
with lower power consumption, and the use of a more modern stable 
graphics API which will be able to work better on current and future 
Apple systems.


[3] https://openjdk.java.net/jeps/382
[4] https://bugs.openjdk.java.net/browse/JDK-8284378


## JDK 19 Early-Access builds

JDK 19 Early-Access builds 18 are now available [5], and are provided 
under the GNU General Public License v2, with the Classpath Exception. 
The Release Notes are available here [6].


[5] https://jdk.java.net/19/
[6] https://jdk.java.net/19/release-notes

### JEPs targeted to JDK 19, so far:
- JEP 422: Linux/RISC-V Port https://openjdk.java.net/jeps/422

### Recent changes that maybe of interest:

Build 18:
- JDK-8284378: Make Metal the default Java 2D rendering pipeline for macOS
- JDK-8265315: Update CLDR to version 41
- JDK-8270090: C2: LCM may prioritize CheckCastPP nodes over projections 
[Reported by JaCoCo]

- JDK-8284361: Updating ASM to 9.3 for JDK 19
- JDK-8284330: jcmd may not be able to find processes in the container
- JDK-8284579: Improve VarHandle checks for interpreter

Build 17:
- JDK-8282819: Deprecate Locale class constructors
- JDK-8254935: Deprecate the PSSParameterSpec(int) constructor
- JDK-8283060: RawNativeLibraries should allow multiple clients to 
load/unload the same library


Build 16:
- JDK-8281561: Disable http DIGEST mechanism with MD5 and SHA-1 by default
- JDK-8264160: Regex \b is not consistent with \w without 
UNICODE_CHARACTER_CLASS

- JDK-8163327: Remove 3DES from the default enabled cipher suites list
- JDK-8267319: Use larger default key sizes and algorithms based on CNSA
- JDK-8283350: (tz) Update Timezone Data to 2022a


## Project Loom Updates

The first Loom related JEP is now in Candidate phase, i.e. JEP: 425: 
Virtual Threads (Preview) [7]. As of now, JEP 425 doesn't yet 'propose 
to target' any particular feature release.


[7] https://openjdk.java.net/jeps/425

In addition, Project Loom early-access builds 19-loom+5-429 (2022/4/4) 
are now available [8] with related Javadoc [9].


These builds are based on JDK 19 and are provided under the GNU General 
Public License, version 2, with the Classpath Exception and are produced 
for the purpose of gathering feedback. Use for any other purpose is at 
your own risk. Proper feedback should be sent to the `loom-dev` mailing 
list [10].


[8] https://jdk.java.net/loom/
[9] https://download.java.net/java/early_access/loom/docs/api/
[10] https://mail.openjdk.java.net/mailman/listinfo/loom-dev


## Topics of Interest:

* New candidate JEP: 426: Vector API (4th Incubator)
https://openjdk.java.net/jeps/426

* Virtual Thread Deep Dive - Inside Java Newscast #23
https://inside.java/2022/04/07/insidejava-newscast-023/

* Project Panama: Say Goodbye to JNI
https://inside.java/2022/04/04/projectpanama/

* Java Cryptographic Extension Survey
https://inside.java/2022/04/12/jce-survey/

As usual, let us know if you find any issues while testing your 
project(s) on the latest JDK early-access builds. Thanks for your support!


--David



[jira] [Commented] (TIKA-3721) DGN parser

2022-04-19 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17524718#comment-17524718
 ] 

Nick Burch commented on TIKA-3721:
--

After a quick look, I can't spot any free tools or libraries for working with 
these files. OpenDGN appears to not use our normal sense of open, and seems to 
want an expensive SDK license

Did find a nice document on the DWG file format on the new OpenDGN site - 
[https://www.opendesign.com/files/guestdownloads/OpenDesign_Specification_for_.dwg_files.pdf]
 - but nothing for the DGN format there that I can find

If you're able to locate a tool or library, we can look at adding support. 
Alternately if your company has licensed the SDK, it's fairly easy for you to 
build your own custom Tika parser to wrap it, see 
https://tika.apache.org/2.3.0/parser_guide.html

> DGN parser
> --
>
> Key: TIKA-3721
> URL: https://issues.apache.org/jira/browse/TIKA-3721
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Dan Coldrick
>Priority: Minor
>
> Does anyone have any experience with the DGN file format by MicroStation? I 
> see TIKA doesn't have a parser so would it be possible to create one? 
> https://docs.fileformat.com/cad/dgn/



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3722) OOM exception on xlsx parsing

2022-04-19 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17524736#comment-17524736
 ] 

Tilman Hausherr commented on TIKA-3722:
---

Please attach the file and try with different -Xmx values.

> OOM exception on xlsx parsing
> -
>
> Key: TIKA-3722
> URL: https://issues.apache.org/jira/browse/TIKA-3722
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.3.0
>Reporter: sagi shechter
>Priority: Major
>
> The file is ~3mb , fails with tika app 2.3.0
> {code:java}
> java.lang.OutOfMemoryError: Java heap space
>     at java.base/java.util.Arrays.copyOf(Arrays.java:3817)
>     at java.base/java.util.BitSet.ensureCapacity(BitSet.java:338)
>     at java.base/java.util.BitSet.expandTo(BitSet.java:353)
>     at java.base/java.util.BitSet.set(BitSet.java:448)
>     at 
> de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.characters(BoilerpipeHTMLContentHandler.java:267)
>     at 
> org.apache.tika.sax.boilerpipe.BoilerpipeContentHandler.characters(BoilerpipeContentHandler.java:165)
>     at 
> org.apache.tika.sax.TeeContentHandler.characters(TeeContentHandler.java:97)
>     at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
>     at 
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
>     at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
>     at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
>     at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
>     at 
> org.apache.tika.sax.SafeContentHandler.access$201(SafeContentHandler.java:47)
>     at 
> org.apache.tika.sax.SafeContentHandler.lambda$new$0(SafeContentHandler.java:57)
>     at 
> org.apache.tika.sax.SafeContentHandler$$Lambda$515/0x000800506c40.write(Unknown
>  Source)
>     at 
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:106)
>     at 
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:250)
>     at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:270)
>     at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:295)
>     at 
> org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator$SheetTextAsHTML.cell(XSSFExcelExtractorDecorator.java:473)
>     at 
> org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.outputCell(XSSFSheetXMLHandler.java:444)
>     at 
> org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.endElement(XSSFSheetXMLHandler.java:317)
>     at 
> org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator$XSSFSheetInterestingPartsCapturer.endElement(XSSFExcelExtractorDecorator.java:561)
>     at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:132)
>     at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
>     at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanEndElement(Unknown 
> Source)
>     at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
>  Source)
>     at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
> Source)
>     at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>     at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>     at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>     at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3720) IllegalArgumentException in PDF parser

2022-04-19 Thread denisn (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17524767#comment-17524767
 ] 

denisn commented on TIKA-3720:
--

I've changed the contentParser to:
{code:java}
 def contentParser(parser: CEParser): ParserDecorator = new 
ParserDecorator(parser) {
   def parse = parser.parse
}{code}
And it seems like the problem is gone. Thank you!

> IllegalArgumentException in PDF parser
> --
>
> Key: TIKA-3720
> URL: https://issues.apache.org/jira/browse/TIKA-3720
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.23
> Environment: Fedora 36, Java 11, Scala 2.13.4, Tika 1.28.1
>Reporter: denisn
>Priority: Major
> Attachments: test.pdf
>
>
> Tika packages: 
> {code:java}
> "org.apache.tika" % "tika" %  1.28.1
> "org.apache.tika" % "tika-core" %  1.28.1
> "org.apache.tika" % "tika-parsers" %  1.28.1
> "org.apache.poi" % "poi" % "4.0.1"
> "org.apache.poi" % "poi-ooxml" % "4.0.1"{code}
> It seems to work fine in 1.22 but in 1.23 and all following versions there is 
> an error. I've attached the pdf file which i've tested.
> Exception text:
> {code:java}
> java.lang.IllegalArgumentException
>     at org.apache.xerces.jaxp.DocumentBuilderFactoryImpl.setAttribute(Unknown 
> Source)
>     at 
> org.apache.tika.utils.XMLReaderUtils.trySetXercesSecurityManager(XMLReaderUtils.java:721)
>     at 
> org.apache.tika.utils.XMLReaderUtils.getDocumentBuilderFactory(XMLReaderUtils.java:289)
>     at 
> org.apache.tika.utils.XMLReaderUtils.getDocumentBuilder(XMLReaderUtils.java:305)
>     at 
> org.apache.tika.parser.external.ExternalParsersConfigReader.read(ExternalParsersConfigReader.java:58)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:67)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:59)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:49)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:44)
>     at 
> org.apache.tika.parser.external.CompositeExternalParser.(CompositeExternalParser.java:44)
>     at 
> org.apache.tika.parser.external.CompositeExternalParser.(CompositeExternalParser.java:37)
>     at 
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>  Method)
>     at 
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>     at 
> java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>     at 
> java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
>     at java.base/java.lang.Class.newInstance(Class.java:584)
>     at 
> org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:358)
>     at 
> org.apache.tika.parser.DefaultParser.getDefaultParsers(DefaultParser.java:55)
>     at org.apache.tika.parser.DefaultParser.(DefaultParser.java:85)
>     at org.apache.tika.parser.DefaultParser.(DefaultParser.java:100)
>     at org.apache.tika.parser.DefaultParser.(DefaultParser.java:112)
>     at org.apache.tika.parser.DefaultParser.(DefaultParser.java:116)
>     at test.Main$DFP.(Main.scala:55)
>     at test.Main$CEParser.getParser(Main.scala:75)
>     at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:269)
>     at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>     at test.Main$.parseNode(Main.scala:194)
>     at test.Main$$anon$1.parse(Main.scala:151)
>     at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
>     at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:104)
>     at 
> org.apache.tika.parser.pdf.ImageGraphicsEngine.processImage(ImageGraphicsEngine.java:321)
>     at 
> org.apache.tika.parser.pdf.ImageGraphicsEngine.drawImage(ImageGraphicsEngine.java:182)
>     at 
> org.apache.pdfbox.contentstream.operator.graphics.DrawObject.process(DrawObject.java:67)
>     at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:939)
>     at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:514)
>     at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:492)
>     at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:155)
>     at 
> org.apache.tika.parser.pdf.ImageGraphicsEngine.run(ImageGraphicsEngine.java:128)
>     at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:159)
>     at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:139)
>     at 
> org.apac

[jira] [Comment Edited] (TIKA-3720) IllegalArgumentException in PDF parser

2022-04-19 Thread denisn (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17524767#comment-17524767
 ] 

denisn edited comment on TIKA-3720 at 4/20/22 6:45 AM:
---

I've changed the contentParser to:
{code:java}
 def contentParser(parser: CEParser): ParserDecorator = new 
ParserDecorator(parser) {
   def parse = parser.parse
}{code}
And it seems like the problem is gone. Thank you!

Also i can now run the app without Tesseract installed. I am getting the errors 
about it in logs but the parsing works. So it wasn't the Tesseract fault after 
all.


was (Author: JIRAUSER288220):
I've changed the contentParser to:
{code:java}
 def contentParser(parser: CEParser): ParserDecorator = new 
ParserDecorator(parser) {
   def parse = parser.parse
}{code}
And it seems like the problem is gone. Thank you!

> IllegalArgumentException in PDF parser
> --
>
> Key: TIKA-3720
> URL: https://issues.apache.org/jira/browse/TIKA-3720
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.23
> Environment: Fedora 36, Java 11, Scala 2.13.4, Tika 1.28.1
>Reporter: denisn
>Priority: Major
> Attachments: test.pdf
>
>
> Tika packages: 
> {code:java}
> "org.apache.tika" % "tika" %  1.28.1
> "org.apache.tika" % "tika-core" %  1.28.1
> "org.apache.tika" % "tika-parsers" %  1.28.1
> "org.apache.poi" % "poi" % "4.0.1"
> "org.apache.poi" % "poi-ooxml" % "4.0.1"{code}
> It seems to work fine in 1.22 but in 1.23 and all following versions there is 
> an error. I've attached the pdf file which i've tested.
> Exception text:
> {code:java}
> java.lang.IllegalArgumentException
>     at org.apache.xerces.jaxp.DocumentBuilderFactoryImpl.setAttribute(Unknown 
> Source)
>     at 
> org.apache.tika.utils.XMLReaderUtils.trySetXercesSecurityManager(XMLReaderUtils.java:721)
>     at 
> org.apache.tika.utils.XMLReaderUtils.getDocumentBuilderFactory(XMLReaderUtils.java:289)
>     at 
> org.apache.tika.utils.XMLReaderUtils.getDocumentBuilder(XMLReaderUtils.java:305)
>     at 
> org.apache.tika.parser.external.ExternalParsersConfigReader.read(ExternalParsersConfigReader.java:58)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:67)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:59)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:49)
>     at 
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:44)
>     at 
> org.apache.tika.parser.external.CompositeExternalParser.(CompositeExternalParser.java:44)
>     at 
> org.apache.tika.parser.external.CompositeExternalParser.(CompositeExternalParser.java:37)
>     at 
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>  Method)
>     at 
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>     at 
> java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>     at 
> java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
>     at java.base/java.lang.Class.newInstance(Class.java:584)
>     at 
> org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:358)
>     at 
> org.apache.tika.parser.DefaultParser.getDefaultParsers(DefaultParser.java:55)
>     at org.apache.tika.parser.DefaultParser.(DefaultParser.java:85)
>     at org.apache.tika.parser.DefaultParser.(DefaultParser.java:100)
>     at org.apache.tika.parser.DefaultParser.(DefaultParser.java:112)
>     at org.apache.tika.parser.DefaultParser.(DefaultParser.java:116)
>     at test.Main$DFP.(Main.scala:55)
>     at test.Main$CEParser.getParser(Main.scala:75)
>     at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:269)
>     at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>     at test.Main$.parseNode(Main.scala:194)
>     at test.Main$$anon$1.parse(Main.scala:151)
>     at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
>     at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:104)
>     at 
> org.apache.tika.parser.pdf.ImageGraphicsEngine.processImage(ImageGraphicsEngine.java:321)
>     at 
> org.apache.tika.parser.pdf.ImageGraphicsEngine.drawImage(ImageGraphicsEngine.java:182)
>     at 
> org.apache.pdfbox.contentstream.operator.graphics.DrawObject.process(DrawObject.java:67)
>     at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:939)
>     at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:51