[jira] [Commented] (TIKA-2118) Misleading exception on a password protected XLS
[ https://issues.apache.org/jira/browse/TIKA-2118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15577172#comment-15577172 ] Seva Alekseyev commented on TIKA-2118: -- The codepage number in the exception is bogus. In my file library, I saw similar exceptions for codepages all over the place. Some part of the file is misparsed and it comes out as codepage number, but it's not. > Misleading exception on a password protected XLS > > > Key: TIKA-2118 > URL: https://issues.apache.org/jira/browse/TIKA-2118 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > > When parsing the following password protected Excel file: > https://dl.dropboxusercontent.com/u/92341073/Copy%20of%20I-LHD%203E.xls > Tika emits an IllegalArgumentException with a message "Unsupported codepage > requested". The inability to parse has nothing to do with codepage, that > error is misleading. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2119) ArrayIndexOutOfBoundsException on a Word document
[ https://issues.apache.org/jira/browse/TIKA-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15577167#comment-15577167 ] Seva Alekseyev commented on TIKA-2119: -- Reopened, linked. > ArrayIndexOutOfBoundsException on a Word document > - > > Key: TIKA-2119 > URL: https://issues.apache.org/jira/browse/TIKA-2119 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > > On the following valid Word document: > https://dl.dropboxusercontent.com/u/92341073/Message%20to%20Eric%20Spooner.doc > the Tika parser throws an ArrayIndexOutOfBoundsException. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2120) NegativeArraySizeException on a password protected Excel workbook
[ https://issues.apache.org/jira/browse/TIKA-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15577151#comment-15577151 ] Seva Alekseyev commented on TIKA-2120: -- Let me recheck... I meant to file a but about the codepage exception too. Maybe pasted wrong. > NegativeArraySizeException on a password protected Excel workbook > - > > Key: TIKA-2120 > URL: https://issues.apache.org/jira/browse/TIKA-2120 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 >Reporter: Seva Alekseyev > > On the following password protected Excel file > https://dl.dropboxusercontent.com/u/92341073/20090906%20real%20inventory.xls > The Tika parser throws NegativeArraySizeException. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2119) ArrayIndexOutOfBoundsException on a Word document
[ https://issues.apache.org/jira/browse/TIKA-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576706#comment-15576706 ] Tim Allison commented on TIKA-2119: --- {noformat} Caused by: java.lang.ArrayIndexOutOfBoundsException at java.lang.System.arraycopy(Native Method) at org.apache.poi.hwpf.sprm.SprmBuffer.append(SprmBuffer.java:128) at org.apache.poi.hwpf.model.PAPBinTable.rebuild(PAPBinTable.java:269) at org.apache.poi.hwpf.model.PAPBinTable.rebuild(PAPBinTable.java:101) at org.apache.poi.hwpf.HWPFOldDocument.(HWPFOldDocument.java:132) at org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:642) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:153) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:169) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:130) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ... 43 more {noformat} This looks like [POI-55604|https://bz.apache.org/bugzilla/show_bug.cgi?id=55604]. Please reopen that issue. Thank you! > ArrayIndexOutOfBoundsException on a Word document > - > > Key: TIKA-2119 > URL: https://issues.apache.org/jira/browse/TIKA-2119 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > > On the following valid Word document: > https://dl.dropboxusercontent.com/u/92341073/Message%20to%20Eric%20Spooner.doc > the Tika parser throws an ArrayIndexOutOfBoundsException. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2117) NullPointerException on PDF
[ https://issues.apache.org/jira/browse/TIKA-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576683#comment-15576683 ] Tim Allison commented on TIKA-2117: --- I confirmed both this and the other issue (TIKA-2121) still exist for Tika trunk. Please confirm that they both exist with PDFBox trunk. If they do, please open issues on PDFBox's JIRA and link to this issue and TIKA-2121. > NullPointerException on PDF > --- > > Key: TIKA-2117 > URL: https://issues.apache.org/jira/browse/TIKA-2117 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > > Tika PDF parser emits a NullPointerException on the following PDF file: > https://dl.dropboxusercontent.com/u/92341073/TEST_THOR.PDF > The file displays as expected in Acrobat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-2120) NegativeArraySizeException on a password protected Excel workbook
[ https://issues.apache.org/jira/browse/TIKA-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2120. --- Resolution: Duplicate > NegativeArraySizeException on a password protected Excel workbook > - > > Key: TIKA-2120 > URL: https://issues.apache.org/jira/browse/TIKA-2120 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 >Reporter: Seva Alekseyev > > On the following password protected Excel file > https://dl.dropboxusercontent.com/u/92341073/20090906%20real%20inventory.xls > The Tika parser throws NegativeArraySizeException. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2118) Misleading exception on a password protected XLS
[ https://issues.apache.org/jira/browse/TIKA-2118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1557#comment-1557 ] Tim Allison commented on TIKA-2118: --- You may want to check with the POI users list. Would the desired outcome be an EncryptedFileException or similar? If the file weren't encrypted, would the current behavior be ok? The parser basically doesn't know what to do with cp3197...and I think that's reasonable. > Misleading exception on a password protected XLS > > > Key: TIKA-2118 > URL: https://issues.apache.org/jira/browse/TIKA-2118 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > > When parsing the following password protected Excel file: > https://dl.dropboxusercontent.com/u/92341073/Copy%20of%20I-LHD%203E.xls > Tika emits an IllegalArgumentException with a message "Unsupported codepage > requested". The inability to parse has nothing to do with codepage, that > error is misleading. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2117) NullPointerException on PDF
[ https://issues.apache.org/jira/browse/TIKA-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576634#comment-15576634 ] Tim Allison commented on TIKA-2117: --- Thank you for opening this issue and the others and for sharing the triggering docs! For PDFs, would you be willing to try the steps described here: [PDF_Text_Problems|https://wiki.apache.org/tika/Troubleshooting%20Tika#PDF_Text_Problems]? Thank you. > NullPointerException on PDF > --- > > Key: TIKA-2117 > URL: https://issues.apache.org/jira/browse/TIKA-2117 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > > Tika PDF parser emits a NullPointerException on the following PDF file: > https://dl.dropboxusercontent.com/u/92341073/TEST_THOR.PDF > The file displays as expected in Acrobat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2121) ClassCastException on a valid PDF
Seva Alekseyev created TIKA-2121: Summary: ClassCastException on a valid PDF Key: TIKA-2121 URL: https://issues.apache.org/jira/browse/TIKA-2121 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.13 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev When parsing the following valid PDF file: https://dl.dropboxusercontent.com/u/92341073/Protheroe%20Clin%20Gastr%202009.pdf the Tika parses throws a ClassCastException with a message that "org.apache.pdfbox.cos.COSString cannot be cast to org.apache.pdfbox.cos.COSDictionary" -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2119) ArrayIndexOutOfBoundsException on a Word document
Seva Alekseyev created TIKA-2119: Summary: ArrayIndexOutOfBoundsException on a Word document Key: TIKA-2119 URL: https://issues.apache.org/jira/browse/TIKA-2119 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.13 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev On the following valid Word document: https://dl.dropboxusercontent.com/u/92341073/Message%20to%20Eric%20Spooner.doc the Tika parser throws an ArrayIndexOutOfBoundsException. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2118) Misleading exception on a password protected XLS
Seva Alekseyev created TIKA-2118: Summary: Misleading exception on a password protected XLS Key: TIKA-2118 URL: https://issues.apache.org/jira/browse/TIKA-2118 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.13 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev When parsing the following password protected Excel file: https://dl.dropboxusercontent.com/u/92341073/Copy%20of%20I-LHD%203E.xls Tika emits an IllegalArgumentException with a message "Unsupported codepage requested". The inability to parse has nothing to do with codepage, that error is misleading. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2117) NullPointerException on PDF
Seva Alekseyev created TIKA-2117: Summary: NullPointerException on PDF Key: TIKA-2117 URL: https://issues.apache.org/jira/browse/TIKA-2117 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.13 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev Tika PDF parser emits a NullPointerException on the following PDF file: https://dl.dropboxusercontent.com/u/92341073/TEST_THOR.PDF The file displays as expected in Acrobat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2115) OOM caused by corrupt embedded OLE object
[ https://issues.apache.org/jira/browse/TIKA-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2115: -- Summary: OOM caused by corrupt embedded OLE object (was: OoM Crash) > OOM caused by corrupt embedded OLE object > - > > Key: TIKA-2115 > URL: https://issues.apache.org/jira/browse/TIKA-2115 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Generic test with tika-app-1.13.jar on test document >Reporter: Thomas Galla > Attachments: TikaTestcase.pptx > > > There is a size field when parsing an embedded OLE object in a Powerpoint > presentation that says there are 2GB of data that needs to be read and the > code simply tries to allocate a buffer for that, which results in OOM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2115) OoM Crash
[ https://issues.apache.org/jira/browse/TIKA-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15575600#comment-15575600 ] Tim Allison commented on TIKA-2115: --- This likely won't make it into Tika 1.14, but thank you for opening the issue and sharing a test file! > OoM Crash > - > > Key: TIKA-2115 > URL: https://issues.apache.org/jira/browse/TIKA-2115 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Generic test with tika-app-1.13.jar on test document >Reporter: Thomas Galla > Attachments: TikaTestcase.pptx > > > There is a size field when parsing an embedded OLE object in a Powerpoint > presentation that says there are 2GB of data that needs to be read and the > code simply tries to allocate a buffer for that, which results in OOM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2116) Upgrade to POI 3.16-beta1 when available
Tim Allison created TIKA-2116: - Summary: Upgrade to POI 3.16-beta1 when available Key: TIKA-2116 URL: https://issues.apache.org/jira/browse/TIKA-2116 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2115) OoM Crash
[ https://issues.apache.org/jira/browse/TIKA-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15575304#comment-15575304 ] Thomas Galla commented on TIKA-2115: Thank you Tim. > OoM Crash > - > > Key: TIKA-2115 > URL: https://issues.apache.org/jira/browse/TIKA-2115 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Generic test with tika-app-1.13.jar on test document >Reporter: Thomas Galla > Attachments: TikaTestcase.pptx > > > There is a size field when parsing an embedded OLE object in a Powerpoint > presentation that says there are 2GB of data that needs to be read and the > code simply tries to allocate a buffer for that, which results in OOM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-2115) OoM Crash
[ https://issues.apache.org/jira/browse/TIKA-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15575297#comment-15575297 ] Tim Allison edited comment on TIKA-2115 at 10/14/16 1:15 PM: - Opened [Bug 60256|https://bz.apache.org/bugzilla/show_bug.cgi?id=60256] to track this. was (Author: talli...@mitre.org): Opened [Bug 60526|https://bz.apache.org/bugzilla/show_bug.cgi?id=60256] to track this. > OoM Crash > - > > Key: TIKA-2115 > URL: https://issues.apache.org/jira/browse/TIKA-2115 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Generic test with tika-app-1.13.jar on test document >Reporter: Thomas Galla > Attachments: TikaTestcase.pptx > > > There is a size field when parsing an embedded OLE object in a Powerpoint > presentation that says there are 2GB of data that needs to be read and the > code simply tries to allocate a buffer for that, which results in OOM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2115) OoM Crash
[ https://issues.apache.org/jira/browse/TIKA-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15575297#comment-15575297 ] Tim Allison commented on TIKA-2115: --- Opened [Bug 60526|https://bz.apache.org/bugzilla/show_bug.cgi?id=60256] to track this. > OoM Crash > - > > Key: TIKA-2115 > URL: https://issues.apache.org/jira/browse/TIKA-2115 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Generic test with tika-app-1.13.jar on test document >Reporter: Thomas Galla > Attachments: TikaTestcase.pptx > > > There is a size field when parsing an embedded OLE object in a Powerpoint > presentation that says there are 2GB of data that needs to be read and the > code simply tries to allocate a buffer for that, which results in OOM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2115) OoM Crash
[ https://issues.apache.org/jira/browse/TIKA-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15575276#comment-15575276 ] Tim Allison commented on TIKA-2115: --- Thank you for opening this. I think we'll have to fix this at the POI level, because at the Tika level, I'm getting {{nativeEntry}}'s size as 4100 and {{part}}'s size as 7168. Something appears to be going wrong in the calculation of {{dataSize}} in Ole10Native's initialization. > OoM Crash > - > > Key: TIKA-2115 > URL: https://issues.apache.org/jira/browse/TIKA-2115 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Generic test with tika-app-1.13.jar on test document >Reporter: Thomas Galla > Attachments: TikaTestcase.pptx > > > There is a size field when parsing an embedded OLE object in a Powerpoint > presentation that says there are 2GB of data that needs to be read and the > code simply tries to allocate a buffer for that, which results in OOM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2115) OoM Crash
[ https://issues.apache.org/jira/browse/TIKA-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Galla updated TIKA-2115: --- Attachment: TikaTestcase.pptx This is the testcase document, basically a stripped version of a customer document leading to the mentioned problem. > OoM Crash > - > > Key: TIKA-2115 > URL: https://issues.apache.org/jira/browse/TIKA-2115 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Generic test with tika-app-1.13.jar on test document >Reporter: Thomas Galla > Attachments: TikaTestcase.pptx > > > There is a size field when parsing an embedded OLE object in a Powerpoint > presentation that says there are 2GB of data that needs to be read and the > code simply tries to allocate a buffer for that, which results in OOM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2115) OoM Crash
Thomas Galla created TIKA-2115: -- Summary: OoM Crash Key: TIKA-2115 URL: https://issues.apache.org/jira/browse/TIKA-2115 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.13 Environment: Generic test with tika-app-1.13.jar on test document Reporter: Thomas Galla There is a size field when parsing an embedded OLE object in a Powerpoint presentation that says there are 2GB of data that needs to be read and the code simply tries to allocate a buffer for that, which results in OOM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)