[jira] [Commented] (TIKA-2118) Misleading exception on a password protected XLS

2016-10-14 Thread Seva Alekseyev (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15577172#comment-15577172
 ] 

Seva Alekseyev commented on TIKA-2118:
--

The codepage number in the exception is bogus. In my file library, I saw 
similar exceptions for codepages all over the place. Some part of the file is 
misparsed and it comes out as codepage number, but it's not.

> Misleading exception on a password protected XLS
> 
>
> Key: TIKA-2118
> URL: https://issues.apache.org/jira/browse/TIKA-2118
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
>
> When parsing the following password protected Excel file:
> https://dl.dropboxusercontent.com/u/92341073/Copy%20of%20I-LHD%203E.xls
> Tika emits an IllegalArgumentException with a message "Unsupported codepage 
> requested". The inability to parse has nothing to do with codepage, that 
> error is misleading.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2119) ArrayIndexOutOfBoundsException on a Word document

2016-10-14 Thread Seva Alekseyev (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15577167#comment-15577167
 ] 

Seva Alekseyev commented on TIKA-2119:
--

Reopened, linked.

> ArrayIndexOutOfBoundsException on a Word document
> -
>
> Key: TIKA-2119
> URL: https://issues.apache.org/jira/browse/TIKA-2119
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
>
> On the following valid Word document:
> https://dl.dropboxusercontent.com/u/92341073/Message%20to%20Eric%20Spooner.doc
> the Tika parser throws an ArrayIndexOutOfBoundsException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2120) NegativeArraySizeException on a password protected Excel workbook

2016-10-14 Thread Seva Alekseyev (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15577151#comment-15577151
 ] 

Seva Alekseyev commented on TIKA-2120:
--

Let me recheck... I meant to file a but about the codepage exception too. Maybe 
pasted wrong.

> NegativeArraySizeException on a password protected Excel workbook
> -
>
> Key: TIKA-2120
> URL: https://issues.apache.org/jira/browse/TIKA-2120
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
>Reporter: Seva Alekseyev
>
> On the following password protected Excel file
> https://dl.dropboxusercontent.com/u/92341073/20090906%20real%20inventory.xls
> The Tika parser throws NegativeArraySizeException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2119) ArrayIndexOutOfBoundsException on a Word document

2016-10-14 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576706#comment-15576706
 ] 

Tim Allison commented on TIKA-2119:
---

{noformat}
Caused by: java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at org.apache.poi.hwpf.sprm.SprmBuffer.append(SprmBuffer.java:128)
at org.apache.poi.hwpf.model.PAPBinTable.rebuild(PAPBinTable.java:269)
at org.apache.poi.hwpf.model.PAPBinTable.rebuild(PAPBinTable.java:101)
at org.apache.poi.hwpf.HWPFOldDocument.(HWPFOldDocument.java:132)
at 
org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:642)
at 
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:153)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:169)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:130)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
... 43 more
{noformat}

This looks like 
[POI-55604|https://bz.apache.org/bugzilla/show_bug.cgi?id=55604].  Please 
reopen that issue.

Thank you!

> ArrayIndexOutOfBoundsException on a Word document
> -
>
> Key: TIKA-2119
> URL: https://issues.apache.org/jira/browse/TIKA-2119
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
>
> On the following valid Word document:
> https://dl.dropboxusercontent.com/u/92341073/Message%20to%20Eric%20Spooner.doc
> the Tika parser throws an ArrayIndexOutOfBoundsException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2117) NullPointerException on PDF

2016-10-14 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576683#comment-15576683
 ] 

Tim Allison commented on TIKA-2117:
---

I confirmed both this and the other issue (TIKA-2121) still exist for Tika 
trunk.  Please confirm that they both exist with PDFBox trunk.  If they do, 
please open issues on PDFBox's JIRA and link to this issue and TIKA-2121.

> NullPointerException on PDF
> ---
>
> Key: TIKA-2117
> URL: https://issues.apache.org/jira/browse/TIKA-2117
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
>
> Tika PDF parser emits a NullPointerException on the following PDF file:
> https://dl.dropboxusercontent.com/u/92341073/TEST_THOR.PDF
> The file displays as expected in Acrobat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2120) NegativeArraySizeException on a password protected Excel workbook

2016-10-14 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2120.
---
Resolution: Duplicate

> NegativeArraySizeException on a password protected Excel workbook
> -
>
> Key: TIKA-2120
> URL: https://issues.apache.org/jira/browse/TIKA-2120
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
>Reporter: Seva Alekseyev
>
> On the following password protected Excel file
> https://dl.dropboxusercontent.com/u/92341073/20090906%20real%20inventory.xls
> The Tika parser throws NegativeArraySizeException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2118) Misleading exception on a password protected XLS

2016-10-14 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1557#comment-1557
 ] 

Tim Allison commented on TIKA-2118:
---

You may want to check with the POI users list.  Would the desired outcome be an 
EncryptedFileException or similar?

If the file weren't encrypted, would the current behavior be ok?  The parser 
basically doesn't know what to do with cp3197...and I think that's reasonable.

> Misleading exception on a password protected XLS
> 
>
> Key: TIKA-2118
> URL: https://issues.apache.org/jira/browse/TIKA-2118
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
>
> When parsing the following password protected Excel file:
> https://dl.dropboxusercontent.com/u/92341073/Copy%20of%20I-LHD%203E.xls
> Tika emits an IllegalArgumentException with a message "Unsupported codepage 
> requested". The inability to parse has nothing to do with codepage, that 
> error is misleading.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2117) NullPointerException on PDF

2016-10-14 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576634#comment-15576634
 ] 

Tim Allison commented on TIKA-2117:
---

Thank you for opening this issue and the others and for sharing the triggering 
docs!  

For PDFs, would you be willing to try the steps described here: 
[PDF_Text_Problems|https://wiki.apache.org/tika/Troubleshooting%20Tika#PDF_Text_Problems]?

Thank you. 

> NullPointerException on PDF
> ---
>
> Key: TIKA-2117
> URL: https://issues.apache.org/jira/browse/TIKA-2117
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
>
> Tika PDF parser emits a NullPointerException on the following PDF file:
> https://dl.dropboxusercontent.com/u/92341073/TEST_THOR.PDF
> The file displays as expected in Acrobat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2121) ClassCastException on a valid PDF

2016-10-14 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2121:


 Summary: ClassCastException on a valid PDF
 Key: TIKA-2121
 URL: https://issues.apache.org/jira/browse/TIKA-2121
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.13
 Environment: Windows 7 x64, JVM 1.8.0_101

Reporter: Seva Alekseyev


When parsing the following valid PDF file:

https://dl.dropboxusercontent.com/u/92341073/Protheroe%20Clin%20Gastr%202009.pdf

the Tika parses throws a ClassCastException with a message that 
"org.apache.pdfbox.cos.COSString cannot be cast to 
org.apache.pdfbox.cos.COSDictionary"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2119) ArrayIndexOutOfBoundsException on a Word document

2016-10-14 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2119:


 Summary: ArrayIndexOutOfBoundsException on a Word document
 Key: TIKA-2119
 URL: https://issues.apache.org/jira/browse/TIKA-2119
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.13
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev


On the following valid Word document:

https://dl.dropboxusercontent.com/u/92341073/Message%20to%20Eric%20Spooner.doc

the Tika parser throws an ArrayIndexOutOfBoundsException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2118) Misleading exception on a password protected XLS

2016-10-14 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2118:


 Summary: Misleading exception on a password protected XLS
 Key: TIKA-2118
 URL: https://issues.apache.org/jira/browse/TIKA-2118
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.13
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev


When parsing the following password protected Excel file:

https://dl.dropboxusercontent.com/u/92341073/Copy%20of%20I-LHD%203E.xls

Tika emits an IllegalArgumentException with a message "Unsupported codepage 
requested". The inability to parse has nothing to do with codepage, that error 
is misleading.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2117) NullPointerException on PDF

2016-10-14 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2117:


 Summary: NullPointerException on PDF
 Key: TIKA-2117
 URL: https://issues.apache.org/jira/browse/TIKA-2117
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.13
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev


Tika PDF parser emits a NullPointerException on the following PDF file:

https://dl.dropboxusercontent.com/u/92341073/TEST_THOR.PDF

The file displays as expected in Acrobat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2115) OOM caused by corrupt embedded OLE object

2016-10-14 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2115:
--
Summary: OOM caused by corrupt embedded OLE object  (was: OoM Crash)

> OOM caused by corrupt embedded OLE object
> -
>
> Key: TIKA-2115
> URL: https://issues.apache.org/jira/browse/TIKA-2115
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment:  Generic test with tika-app-1.13.jar on test document
>Reporter: Thomas Galla
> Attachments: TikaTestcase.pptx
>
>
> There is a size field when parsing an embedded OLE object in a Powerpoint 
> presentation that says there are 2GB of data that needs to be read and the 
> code simply tries to allocate a buffer for that, which results in OOM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2115) OoM Crash

2016-10-14 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15575600#comment-15575600
 ] 

Tim Allison commented on TIKA-2115:
---

This likely won't make it into Tika 1.14, but thank you for opening the issue 
and sharing a test file!

> OoM Crash
> -
>
> Key: TIKA-2115
> URL: https://issues.apache.org/jira/browse/TIKA-2115
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment:  Generic test with tika-app-1.13.jar on test document
>Reporter: Thomas Galla
> Attachments: TikaTestcase.pptx
>
>
> There is a size field when parsing an embedded OLE object in a Powerpoint 
> presentation that says there are 2GB of data that needs to be read and the 
> code simply tries to allocate a buffer for that, which results in OOM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2116) Upgrade to POI 3.16-beta1 when available

2016-10-14 Thread Tim Allison (JIRA)
Tim Allison created TIKA-2116:
-

 Summary: Upgrade to POI 3.16-beta1 when available
 Key: TIKA-2116
 URL: https://issues.apache.org/jira/browse/TIKA-2116
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2115) OoM Crash

2016-10-14 Thread Thomas Galla (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15575304#comment-15575304
 ] 

Thomas Galla commented on TIKA-2115:


Thank you Tim.


> OoM Crash
> -
>
> Key: TIKA-2115
> URL: https://issues.apache.org/jira/browse/TIKA-2115
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment:  Generic test with tika-app-1.13.jar on test document
>Reporter: Thomas Galla
> Attachments: TikaTestcase.pptx
>
>
> There is a size field when parsing an embedded OLE object in a Powerpoint 
> presentation that says there are 2GB of data that needs to be read and the 
> code simply tries to allocate a buffer for that, which results in OOM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-2115) OoM Crash

2016-10-14 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15575297#comment-15575297
 ] 

Tim Allison edited comment on TIKA-2115 at 10/14/16 1:15 PM:
-

Opened [Bug 60256|https://bz.apache.org/bugzilla/show_bug.cgi?id=60256] to 
track this.


was (Author: talli...@mitre.org):
Opened [Bug 60526|https://bz.apache.org/bugzilla/show_bug.cgi?id=60256] to 
track this.

> OoM Crash
> -
>
> Key: TIKA-2115
> URL: https://issues.apache.org/jira/browse/TIKA-2115
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment:  Generic test with tika-app-1.13.jar on test document
>Reporter: Thomas Galla
> Attachments: TikaTestcase.pptx
>
>
> There is a size field when parsing an embedded OLE object in a Powerpoint 
> presentation that says there are 2GB of data that needs to be read and the 
> code simply tries to allocate a buffer for that, which results in OOM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2115) OoM Crash

2016-10-14 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15575297#comment-15575297
 ] 

Tim Allison commented on TIKA-2115:
---

Opened [Bug 60526|https://bz.apache.org/bugzilla/show_bug.cgi?id=60256] to 
track this.

> OoM Crash
> -
>
> Key: TIKA-2115
> URL: https://issues.apache.org/jira/browse/TIKA-2115
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment:  Generic test with tika-app-1.13.jar on test document
>Reporter: Thomas Galla
> Attachments: TikaTestcase.pptx
>
>
> There is a size field when parsing an embedded OLE object in a Powerpoint 
> presentation that says there are 2GB of data that needs to be read and the 
> code simply tries to allocate a buffer for that, which results in OOM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2115) OoM Crash

2016-10-14 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15575276#comment-15575276
 ] 

Tim Allison commented on TIKA-2115:
---

Thank you for opening this.  I think we'll have to fix this at the POI level, 
because at the Tika level,  I'm getting {{nativeEntry}}'s size as 4100 and 
{{part}}'s size as 7168.  Something appears to be going wrong in the 
calculation of {{dataSize}} in Ole10Native's initialization.

> OoM Crash
> -
>
> Key: TIKA-2115
> URL: https://issues.apache.org/jira/browse/TIKA-2115
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment:  Generic test with tika-app-1.13.jar on test document
>Reporter: Thomas Galla
> Attachments: TikaTestcase.pptx
>
>
> There is a size field when parsing an embedded OLE object in a Powerpoint 
> presentation that says there are 2GB of data that needs to be read and the 
> code simply tries to allocate a buffer for that, which results in OOM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2115) OoM Crash

2016-10-14 Thread Thomas Galla (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Galla updated TIKA-2115:
---
Attachment: TikaTestcase.pptx

This is the testcase document, basically a stripped version of a customer 
document leading to the mentioned problem.

> OoM Crash
> -
>
> Key: TIKA-2115
> URL: https://issues.apache.org/jira/browse/TIKA-2115
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment:  Generic test with tika-app-1.13.jar on test document
>Reporter: Thomas Galla
> Attachments: TikaTestcase.pptx
>
>
> There is a size field when parsing an embedded OLE object in a Powerpoint 
> presentation that says there are 2GB of data that needs to be read and the 
> code simply tries to allocate a buffer for that, which results in OOM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2115) OoM Crash

2016-10-14 Thread Thomas Galla (JIRA)
Thomas Galla created TIKA-2115:
--

 Summary: OoM Crash
 Key: TIKA-2115
 URL: https://issues.apache.org/jira/browse/TIKA-2115
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.13
 Environment:  Generic test with tika-app-1.13.jar on test document
Reporter: Thomas Galla


There is a size field when parsing an embedded OLE object in a Powerpoint 
presentation that says there are 2GB of data that needs to be read and the code 
simply tries to allocate a buffer for that, which results in OOM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)