[jira] [Updated] (TIKA-2219) CharsetDetector no longer detects windows-1252 charset

2016-12-19 Thread Pascal Essiembre (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pascal Essiembre updated TIKA-2219:
---
Description: 
Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is 
always detected instead.  While not tested, this likely affects other 
windows-125* encodings as well.

I tracked it down to a change in the 
{{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method.  Now it always 
returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? 
"windows-1252" : "ISO-8859-1";}}

Now that condition has been moved to the {{match(CharsetDetector det)}} method 
so that the returned CharsetMatch has the proper name.  The problem with that 
is {{CharsetDetector#detectAll()}} method overwrites the correct match with a 
new one that will return the value of {{#getName()}}  from the 
{{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case).

There might be legitimate reasons why the {{CharsetMatch}} instances in 
{{detectAll()}} method are replaced with new ones, but changing this code in 
that method appears to work for me:

// Remove this:
//CharsetMatch m = new CharsetMatch(this, csr, confidence);
//matches.add(m);

// Add this instead:
matches.add(charsetMatch);


  was:
Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is 
always detected instead.  While not tested, this likely affects other 
windows-125* encodings.

I tracked it down to a change in the 
{{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method.  Now it always 
returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? 
"windows-1252" : "ISO-8859-1";}}

Now that condition has been moved to the {{match(CharsetDetector det)}} method 
so that the returned CharsetMatch has the proper name.  The problem with that 
is {{CharsetDetector#detectAll()}} method overwrites the correct match with a 
new one that will return the value of {{#getName()}}  from the 
{{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case).

There might be legitimate reasons why the {{CharsetMatch}} instances in 
{{detectAll()}} method are replaced with new ones, but changing this code in 
that method appears to work for me:

// Remove this:
//CharsetMatch m = new CharsetMatch(this, csr, confidence);
//matches.add(m);

// Add this instead:
matches.add(charsetMatch);



> CharsetDetector no longer detects windows-1252 charset
> --
>
> Key: TIKA-2219
> URL: https://issues.apache.org/jira/browse/TIKA-2219
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Any.
>Reporter: Pascal Essiembre
>Priority: Minor
>
> Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is 
> always detected instead.  While not tested, this likely affects other 
> windows-125* encodings as well.
> I tracked it down to a change in the 
> {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method.  Now it always 
> returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? 
> "windows-1252" : "ISO-8859-1";}}
> Now that condition has been moved to the {{match(CharsetDetector det)}} 
> method so that the returned CharsetMatch has the proper name.  The problem 
> with that is {{CharsetDetector#detectAll()}} method overwrites the correct 
> match with a new one that will return the value of {{#getName()}}  from the 
> {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case).
> There might be legitimate reasons why the {{CharsetMatch}} instances in 
> {{detectAll()}} method are replaced with new ones, but changing this code in 
> that method appears to work for me:
> // Remove this:
> //CharsetMatch m = new CharsetMatch(this, csr, 
> confidence);
> //matches.add(m);
> // Add this instead:
> matches.add(charsetMatch);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2219) CharsetDetector no longer detects windows-1252 charset

2016-12-19 Thread Pascal Essiembre (JIRA)
Pascal Essiembre created TIKA-2219:
--

 Summary: CharsetDetector no longer detects windows-1252 charset
 Key: TIKA-2219
 URL: https://issues.apache.org/jira/browse/TIKA-2219
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.14
 Environment: Any.
Reporter: Pascal Essiembre
Priority: Minor


Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is 
always detected instead.  While not tested, this likely affects other 
windows-125* encodings.

I tracked it down to a change in the 
{{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method.  Now it always 
returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? 
"windows-1252" : "ISO-8859-1";}}

Now that condition has been moved to the {{match(CharsetDetector det)}} method 
so that the returned CharsetMatch has the proper name.  The problem with that 
is {{CharsetDetector#detectAll()}} method overwrites the correct match with a 
new one that will return the value of {{#getName()}}  from the 
{{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case).

There might be legitimate reasons why the {{CharsetMatch}} instances in 
{{detectAll()}} method are replaced with new ones, but changing this code in 
that method appears to work for me:

// Remove this:
//CharsetMatch m = new CharsetMatch(this, csr, confidence);
//matches.add(m);

// Add this instead:
matches.add(charsetMatch);




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2218) Add a few more places where PPTX relationships might include an attachment

2016-12-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762403#comment-15762403
 ] 

Hudson commented on TIKA-2218:
--

SUCCESS: Integrated in Jenkins build tika-2.x #182 (See 
[https://builds.apache.org/job/tika-2.x/182/])
 TIKA-2218 -- add a new new locations within a pptx to check for (tallison: rev 
4f04b6c3e9645bfe5fdb7d7f1078051c0eca7fcc)
* (edit) 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java
* (edit) 
tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
* (edit) 
tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLContainerExtractionTest.java


> Add a few more places where PPTX relationships might include an attachment
> --
>
> Key: TIKA-2218
> URL: https://issues.apache.org/jira/browse/TIKA-2218
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 2.0, 1.15
>
>
> Slide masters, the overall master, handout, notes and notes master can all 
> contain embedded objects.  Let's add those to the {{mainDocumentParts}} in 
> pptx.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2218) Add a few more places where PPTX relationships might include an attachment

2016-12-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762394#comment-15762394
 ] 

Hudson commented on TIKA-2218:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1161 (See 
[https://builds.apache.org/job/Tika-trunk/1161/])
TIKA-2218 -- add a few more places where .pptx can include embedded (tallison: 
rev ca37313a716d4eaa3a15a4ba770f89ee23832e99)
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXSLFPowerPointExtractorDecorator.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLContainerExtractionTest.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXSLFExtractorTest.java


> Add a few more places where PPTX relationships might include an attachment
> --
>
> Key: TIKA-2218
> URL: https://issues.apache.org/jira/browse/TIKA-2218
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 2.0, 1.15
>
>
> Slide masters, the overall master, handout, notes and notes master can all 
> contain embedded objects.  Let's add those to the {{mainDocumentParts}} in 
> pptx.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2218) Add a few more places where PPTX relationships might include an attachment

2016-12-19 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2218.
---
   Resolution: Fixed
Fix Version/s: 1.15
   2.0

> Add a few more places where PPTX relationships might include an attachment
> --
>
> Key: TIKA-2218
> URL: https://issues.apache.org/jira/browse/TIKA-2218
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 2.0, 1.15
>
>
> Slide masters, the overall master, handout, notes and notes master can all 
> contain embedded objects.  Let's add those to the {{mainDocumentParts}} in 
> pptx.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2218) Add a few more places where PPTX relationships might include an attachment

2016-12-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762337#comment-15762337
 ] 

Hudson commented on TIKA-2218:
--

FAILURE: Integrated in Jenkins build tika-2.x-windows #83 (See 
[https://builds.apache.org/job/tika-2.x-windows/83/])
 TIKA-2218 -- add a new new locations within a pptx to check for (tallison: rev 
4f04b6c3e9645bfe5fdb7d7f1078051c0eca7fcc)
* (edit) 
tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
* (edit) 
tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLContainerExtractionTest.java
* (edit) 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java


> Add a few more places where PPTX relationships might include an attachment
> --
>
> Key: TIKA-2218
> URL: https://issues.apache.org/jira/browse/TIKA-2218
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
>
> Slide masters, the overall master, handout, notes and notes master can all 
> contain embedded objects.  Let's add those to the {{mainDocumentParts}} in 
> pptx.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


tika-2.x-windows - Build # 83 - Still Failing

2016-12-19 Thread Apache Jenkins Server
The Apache Jenkins build system has built tika-2.x-windows (build #83)

Status: Still Failing

Check console output at https://builds.apache.org/job/tika-2.x-windows/83/ to 
view the results.

[jira] [Created] (TIKA-2218) Add a few more places where PPTX relationships might include an attachment

2016-12-19 Thread Tim Allison (JIRA)
Tim Allison created TIKA-2218:
-

 Summary: Add a few more places where PPTX relationships might 
include an attachment
 Key: TIKA-2218
 URL: https://issues.apache.org/jira/browse/TIKA-2218
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor


Slide masters, the overall master, handout, notes and notes master can all 
contain embedded objects.  Let's add those to the {{mainDocumentParts}} in pptx.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2217) RuntimeException on a PPT with a movie

2016-12-19 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2217:
-
Description: 
https://dl.dropboxusercontent.com/u/92341073/lecture%20WH%202002.ppt

java.lang.RuntimeException for : "Couldn't instantiate the 
class for type with id 1000 on class class org.apache.poi.hslf.record.Document 
: java.lang.reflect.InvocationTargetException
Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 1033 on class class org.apache.poi.hslf.record.ExObjList : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 4103 on class class org.apache.poi.hslf.record.ExMCIMovie : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 4101 on class class org.apache.poi.hslf.record.ExVideoContainer : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.ArrayIndexOutOfBoundsException: 1"
java.lang.RuntimeException: Couldn't instantiate the class for type with id 
1000 on class class org.apache.poi.hslf.record.Document : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 1033 on class class org.apache.poi.hslf.record.ExObjList : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 4103 on class class org.apache.poi.hslf.record.ExMCIMovie : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 4101 on class class org.apache.poi.hslf.record.ExVideoContainer : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.poi.hslf.record.Record.createRecordForType:185
at org.apache.poi.hslf.record.Record.buildRecordAtOffset:103
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read:276
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords:257
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.:161
at org.apache.poi.hslf.usermodel.HSLFSlideShow.:154
at org.apache.tika.parser.microsoft.HSLFExtractor.parse:65
at org.apache.tika.parser.microsoft.OfficeParser.parse:172
at org.apache.tika.parser.microsoft.OfficeParser.parse:130
Caused by: java.lang.reflect.InvocationTargetException: 
at sun.reflect.GeneratedConstructorAccessor45.newInstance:-1
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance:-1
at java.lang.reflect.Constructor.newInstance:-1
at org.apache.poi.hslf.record.Record.createRecordForType:181
at org.apache.poi.hslf.record.Record.buildRecordAtOffset:103
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read:276
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords:257
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.:161
at org.apache.poi.hslf.usermodel.HSLFSlideShow.:154
at org.apache.tika.parser.microsoft.HSLFExtractor.parse:65
at org.apache.tika.parser.microsoft.OfficeParser.parse:172
at org.apache.tika.parser.microsoft.OfficeParser.parse:130
Caused by: java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 1033 on class class org.apache.poi.hslf.record.ExObjList : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 4103 on class class org.apache.poi.hslf.record.ExMCIMovie : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 4101 on class class org.apache.poi.hslf.record.ExVideoContainer : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.poi.hslf.record.Record.createRecordForType:185
at org.apache.poi.hslf.record.Record.findChildRecords:128
at org.apache.poi.hslf.record.Document.:133
at sun.reflect.GeneratedConstructorAccessor45.newInstance:-1
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance:-1
at java.lang.reflect.Constructor.newInstance:-1
at org.apache.poi.hslf.record.Record.createRecordForType:181
at org.apache.poi.hslf.record.Record.buildRecordAtOffset:103
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read:276
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords:257
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.:161
at org.apache.poi.hslf.usermodel.HSLFSlideShow.:154
at org.apache.tika.parser.microsoft.HSLFExtractor.parse:65
at org.apache.tika.parser.microsoft.OfficeParser.parse:172
at 

[jira] [Updated] (TIKA-2217) RuntimeException on a PPT with a movie

2016-12-19 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2217:
-
Description: 
https://dl.dropboxusercontent.com/u/92341073/lecture%20WH%202002.ppt

java.lang.RuntimeException: Couldn't instantiate the class for type with id 
1000 on class class org.apache.poi.hslf.record.Document : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 1033 on class class org.apache.poi.hslf.record.ExObjList : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 4103 on class class org.apache.poi.hslf.record.ExMCIMovie : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 4101 on class class org.apache.poi.hslf.record.ExVideoContainer : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.poi.hslf.record.Record.createRecordForType:185
at org.apache.poi.hslf.record.Record.buildRecordAtOffset:103
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read:276
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords:257
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.:161
at org.apache.poi.hslf.usermodel.HSLFSlideShow.:154
at org.apache.tika.parser.microsoft.HSLFExtractor.parse:65
at org.apache.tika.parser.microsoft.OfficeParser.parse:172
at org.apache.tika.parser.microsoft.OfficeParser.parse:130
Caused by: java.lang.reflect.InvocationTargetException: 
at sun.reflect.GeneratedConstructorAccessor45.newInstance:-1
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance:-1
at java.lang.reflect.Constructor.newInstance:-1
at org.apache.poi.hslf.record.Record.createRecordForType:181
at org.apache.poi.hslf.record.Record.buildRecordAtOffset:103
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read:276
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords:257
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.:161
at org.apache.poi.hslf.usermodel.HSLFSlideShow.:154
at org.apache.tika.parser.microsoft.HSLFExtractor.parse:65
at org.apache.tika.parser.microsoft.OfficeParser.parse:172
at org.apache.tika.parser.microsoft.OfficeParser.parse:130
Caused by: java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 1033 on class class org.apache.poi.hslf.record.ExObjList : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 4103 on class class org.apache.poi.hslf.record.ExMCIMovie : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 4101 on class class org.apache.poi.hslf.record.ExVideoContainer : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.poi.hslf.record.Record.createRecordForType:185
at org.apache.poi.hslf.record.Record.findChildRecords:128
at org.apache.poi.hslf.record.Document.:133
at sun.reflect.GeneratedConstructorAccessor45.newInstance:-1
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance:-1
at java.lang.reflect.Constructor.newInstance:-1
at org.apache.poi.hslf.record.Record.createRecordForType:181
at org.apache.poi.hslf.record.Record.buildRecordAtOffset:103
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read:276
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords:257
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.:161
at org.apache.poi.hslf.usermodel.HSLFSlideShow.:154
at org.apache.tika.parser.microsoft.HSLFExtractor.parse:65
at org.apache.tika.parser.microsoft.OfficeParser.parse:172
at org.apache.tika.parser.microsoft.OfficeParser.parse:130
Caused by: java.lang.reflect.InvocationTargetException: 
at sun.reflect.GeneratedConstructorAccessor47.newInstance:-1
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance:-1
at java.lang.reflect.Constructor.newInstance:-1
at org.apache.poi.hslf.record.Record.createRecordForType:181
at org.apache.poi.hslf.record.Record.findChildRecords:128
at org.apache.poi.hslf.record.Document.:133
at sun.reflect.GeneratedConstructorAccessor45.newInstance:-1
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance:-1
at java.lang.reflect.Constructor.newInstance:-1
at org.apache.poi.hslf.record.Record.createRecordForType:181
at org.apache.poi.hslf.record.Record.buildRecordAtOffset:103
at 

[jira] [Updated] (TIKA-2216) ArrayIndexOutOfBoundsException on a valid Word file

2016-12-19 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2216:
-
Description: 
https://dl.dropboxusercontent.com/u/92341073/lecture%20WH%202002.ppt

java.lang.ArrayIndexOutOfBoundsException: 
at org.apache.poi.hwpf.sprm.SprmBuffer.append:128
at org.apache.poi.hwpf.model.PAPBinTable.rebuild:269
at org.apache.poi.hwpf.model.PAPBinTable.rebuild:101
at org.apache.poi.hwpf.HWPFOldDocument.:132
at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642
at org.apache.tika.parser.microsoft.WordExtractor.parse:153
at org.apache.tika.parser.microsoft.OfficeParser.parse:169
at org.apache.tika.parser.microsoft.OfficeParser.parse:130

  was:
java.lang.ArrayIndexOutOfBoundsException: 
at org.apache.poi.hwpf.sprm.SprmBuffer.append:128
at org.apache.poi.hwpf.model.PAPBinTable.rebuild:269
at org.apache.poi.hwpf.model.PAPBinTable.rebuild:101
at org.apache.poi.hwpf.HWPFOldDocument.:132
at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642
at org.apache.tika.parser.microsoft.WordExtractor.parse:153
at org.apache.tika.parser.microsoft.OfficeParser.parse:169
at org.apache.tika.parser.microsoft.OfficeParser.parse:130


> ArrayIndexOutOfBoundsException on a valid Word file
> ---
>
> Key: TIKA-2216
> URL: https://issues.apache.org/jira/browse/TIKA-2216
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: TB Coord RFCb.doc
>
>
> https://dl.dropboxusercontent.com/u/92341073/lecture%20WH%202002.ppt
> java.lang.ArrayIndexOutOfBoundsException: 
>   at org.apache.poi.hwpf.sprm.SprmBuffer.append:128
>   at org.apache.poi.hwpf.model.PAPBinTable.rebuild:269
>   at org.apache.poi.hwpf.model.PAPBinTable.rebuild:101
>   at org.apache.poi.hwpf.HWPFOldDocument.:132
>   at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642
>   at org.apache.tika.parser.microsoft.WordExtractor.parse:153
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:169
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2216) ArrayIndexOutOfBoundsException on a valid Word file

2016-12-19 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2216:
-
Description: 
java.lang.ArrayIndexOutOfBoundsException: 
at org.apache.poi.hwpf.sprm.SprmBuffer.append:128
at org.apache.poi.hwpf.model.PAPBinTable.rebuild:269
at org.apache.poi.hwpf.model.PAPBinTable.rebuild:101
at org.apache.poi.hwpf.HWPFOldDocument.:132
at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642
at org.apache.tika.parser.microsoft.WordExtractor.parse:153
at org.apache.tika.parser.microsoft.OfficeParser.parse:169
at org.apache.tika.parser.microsoft.OfficeParser.parse:130

  was:
https://dl.dropboxusercontent.com/u/92341073/lecture%20WH%202002.ppt

java.lang.ArrayIndexOutOfBoundsException: 
at org.apache.poi.hwpf.sprm.SprmBuffer.append:128
at org.apache.poi.hwpf.model.PAPBinTable.rebuild:269
at org.apache.poi.hwpf.model.PAPBinTable.rebuild:101
at org.apache.poi.hwpf.HWPFOldDocument.:132
at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642
at org.apache.tika.parser.microsoft.WordExtractor.parse:153
at org.apache.tika.parser.microsoft.OfficeParser.parse:169
at org.apache.tika.parser.microsoft.OfficeParser.parse:130


> ArrayIndexOutOfBoundsException on a valid Word file
> ---
>
> Key: TIKA-2216
> URL: https://issues.apache.org/jira/browse/TIKA-2216
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: TB Coord RFCb.doc
>
>
> java.lang.ArrayIndexOutOfBoundsException: 
>   at org.apache.poi.hwpf.sprm.SprmBuffer.append:128
>   at org.apache.poi.hwpf.model.PAPBinTable.rebuild:269
>   at org.apache.poi.hwpf.model.PAPBinTable.rebuild:101
>   at org.apache.poi.hwpf.HWPFOldDocument.:132
>   at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642
>   at org.apache.tika.parser.microsoft.WordExtractor.parse:153
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:169
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2217) RuntimeException on a PPT with a movie

2016-12-19 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2217:


 Summary: RuntimeException on a PPT with a movie
 Key: TIKA-2217
 URL: https://issues.apache.org/jira/browse/TIKA-2217
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.14
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev


java.lang.RuntimeException for 
63933/<\\ai-storm\FScan\Scan_2016-12-16_01-06-55\Folders\75457622\lecture WH 
2002.ppt>: "Couldn't instantiate the class for type with id 1000 on class class 
org.apache.poi.hslf.record.Document : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 1033 on class class org.apache.poi.hslf.record.ExObjList : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 4103 on class class org.apache.poi.hslf.record.ExMCIMovie : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 4101 on class class org.apache.poi.hslf.record.ExVideoContainer : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.ArrayIndexOutOfBoundsException: 1"
java.lang.RuntimeException: Couldn't instantiate the class for type with id 
1000 on class class org.apache.poi.hslf.record.Document : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 1033 on class class org.apache.poi.hslf.record.ExObjList : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 4103 on class class org.apache.poi.hslf.record.ExMCIMovie : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 4101 on class class org.apache.poi.hslf.record.ExVideoContainer : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.poi.hslf.record.Record.createRecordForType:185
at org.apache.poi.hslf.record.Record.buildRecordAtOffset:103
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read:276
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords:257
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.:161
at org.apache.poi.hslf.usermodel.HSLFSlideShow.:154
at org.apache.tika.parser.microsoft.HSLFExtractor.parse:65
at org.apache.tika.parser.microsoft.OfficeParser.parse:172
at org.apache.tika.parser.microsoft.OfficeParser.parse:130
Caused by: java.lang.reflect.InvocationTargetException: 
at sun.reflect.GeneratedConstructorAccessor45.newInstance:-1
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance:-1
at java.lang.reflect.Constructor.newInstance:-1
at org.apache.poi.hslf.record.Record.createRecordForType:181
at org.apache.poi.hslf.record.Record.buildRecordAtOffset:103
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read:276
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords:257
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.:161
at org.apache.poi.hslf.usermodel.HSLFSlideShow.:154
at org.apache.tika.parser.microsoft.HSLFExtractor.parse:65
at org.apache.tika.parser.microsoft.OfficeParser.parse:172
at org.apache.tika.parser.microsoft.OfficeParser.parse:130
Caused by: java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 1033 on class class org.apache.poi.hslf.record.ExObjList : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 4103 on class class org.apache.poi.hslf.record.ExMCIMovie : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type 
with id 4101 on class class org.apache.poi.hslf.record.ExVideoContainer : 
java.lang.reflect.InvocationTargetException
Cause was : java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.poi.hslf.record.Record.createRecordForType:185
at org.apache.poi.hslf.record.Record.findChildRecords:128
at org.apache.poi.hslf.record.Document.:133
at sun.reflect.GeneratedConstructorAccessor45.newInstance:-1
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance:-1
at java.lang.reflect.Constructor.newInstance:-1
at org.apache.poi.hslf.record.Record.createRecordForType:181
at org.apache.poi.hslf.record.Record.buildRecordAtOffset:103
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read:276
at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords:257
at 

[jira] [Updated] (TIKA-2216) ArrayIndexOutOfBoundsException on a valid Word file

2016-12-19 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2216:
-
Attachment: TB Coord RFCb.doc

> ArrayIndexOutOfBoundsException on a valid Word file
> ---
>
> Key: TIKA-2216
> URL: https://issues.apache.org/jira/browse/TIKA-2216
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: TB Coord RFCb.doc
>
>
> java.lang.ArrayIndexOutOfBoundsException: 
>   at org.apache.poi.hwpf.sprm.SprmBuffer.append:128
>   at org.apache.poi.hwpf.model.PAPBinTable.rebuild:269
>   at org.apache.poi.hwpf.model.PAPBinTable.rebuild:101
>   at org.apache.poi.hwpf.HWPFOldDocument.:132
>   at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642
>   at org.apache.tika.parser.microsoft.WordExtractor.parse:153
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:169
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2216) ArrayIndexOutOfBoundsException on a valid Word file

2016-12-19 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2216:


 Summary: ArrayIndexOutOfBoundsException on a valid Word file
 Key: TIKA-2216
 URL: https://issues.apache.org/jira/browse/TIKA-2216
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.14
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev


java.lang.ArrayIndexOutOfBoundsException: 
at org.apache.poi.hwpf.sprm.SprmBuffer.append:128
at org.apache.poi.hwpf.model.PAPBinTable.rebuild:269
at org.apache.poi.hwpf.model.PAPBinTable.rebuild:101
at org.apache.poi.hwpf.HWPFOldDocument.:132
at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642
at org.apache.tika.parser.microsoft.WordExtractor.parse:153
at org.apache.tika.parser.microsoft.OfficeParser.parse:169
at org.apache.tika.parser.microsoft.OfficeParser.parse:130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2215) TikaException about "Invalid embedded resource" on a valid PPT file

2016-12-19 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2215:


 Summary: TikaException about "Invalid embedded resource" on a 
valid PPT file
 Key: TIKA-2215
 URL: https://issues.apache.org/jira/browse/TIKA-2215
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.14
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev
 Attachments: Iverson.ppt

On the attached file, which opens with PowerPoint, the Tika parser throws the 
following error:

org.apache.tika.exception.TikaException: Invalid embedded resource
at 
org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc:243
at 
org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources:390
at org.apache.tika.parser.microsoft.HSLFExtractor.parse:142
at org.apache.tika.parser.microsoft.OfficeParser.parse:172
at org.apache.tika.parser.microsoft.OfficeParser.parse:130
Caused by: java.lang.IndexOutOfBoundsException: Block 32630271 not found
at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.getBlockAt:486
at 
org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next:169
at 
org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next:142
at org.apache.poi.poifs.filesystem.NDocumentInputStream.readFully:248
at org.apache.poi.poifs.filesystem.DocumentInputStream.readFully:165
at org.apache.poi.poifs.filesystem.DocumentInputStream.readFully:160
at 
org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc:226
at 
org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources:390
at org.apache.tika.parser.microsoft.HSLFExtractor.parse:142
at org.apache.tika.parser.microsoft.OfficeParser.parse:172
at org.apache.tika.parser.microsoft.OfficeParser.parse:130
Caused by: java.lang.IndexOutOfBoundsException: Unable to read 512 bytes from 
16706699264 in stream of length 164352
at org.apache.poi.poifs.nio.ByteArrayBackedDataSource.read:42
at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.getBlockAt:484
at 
org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next:169
at 
org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next:142
at org.apache.poi.poifs.filesystem.NDocumentInputStream.readFully:248
at org.apache.poi.poifs.filesystem.DocumentInputStream.readFully:165
at org.apache.poi.poifs.filesystem.DocumentInputStream.readFully:160
at 
org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc:226
at 
org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources:390
at org.apache.tika.parser.microsoft.HSLFExtractor.parse:142
at org.apache.tika.parser.microsoft.OfficeParser.parse:172
at org.apache.tika.parser.microsoft.OfficeParser.parse:130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2215) TikaException about "Invalid embedded resource" on a valid PPT file

2016-12-19 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2215:
-
Attachment: Iverson.ppt

> TikaException about "Invalid embedded resource" on a valid PPT file
> ---
>
> Key: TIKA-2215
> URL: https://issues.apache.org/jira/browse/TIKA-2215
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Iverson.ppt
>
>
> On the attached file, which opens with PowerPoint, the Tika parser throws the 
> following error:
> org.apache.tika.exception.TikaException: Invalid embedded resource
>   at 
> org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc:243
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources:390
>   at org.apache.tika.parser.microsoft.HSLFExtractor.parse:142
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:172
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:130
> Caused by: java.lang.IndexOutOfBoundsException: Block 32630271 not found
>   at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.getBlockAt:486
>   at 
> org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next:169
>   at 
> org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next:142
>   at org.apache.poi.poifs.filesystem.NDocumentInputStream.readFully:248
>   at org.apache.poi.poifs.filesystem.DocumentInputStream.readFully:165
>   at org.apache.poi.poifs.filesystem.DocumentInputStream.readFully:160
>   at 
> org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc:226
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources:390
>   at org.apache.tika.parser.microsoft.HSLFExtractor.parse:142
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:172
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:130
> Caused by: java.lang.IndexOutOfBoundsException: Unable to read 512 bytes from 
> 16706699264 in stream of length 164352
>   at org.apache.poi.poifs.nio.ByteArrayBackedDataSource.read:42
>   at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.getBlockAt:484
>   at 
> org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next:169
>   at 
> org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next:142
>   at org.apache.poi.poifs.filesystem.NDocumentInputStream.readFully:248
>   at org.apache.poi.poifs.filesystem.DocumentInputStream.readFully:165
>   at org.apache.poi.poifs.filesystem.DocumentInputStream.readFully:160
>   at 
> org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc:226
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources:390
>   at org.apache.tika.parser.microsoft.HSLFExtractor.parse:142
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:172
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2214) ArrayIndexOutOfBoundsException on a valid Word file

2016-12-19 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2214:


 Summary: ArrayIndexOutOfBoundsException on a valid Word file
 Key: TIKA-2214
 URL: https://issues.apache.org/jira/browse/TIKA-2214
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.14
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev
 Attachments: NONCONT.DOC

On the attached file, which opens with Word, the Tika parser throws the 
following error:

java.lang.ArrayIndexOutOfBoundsException: 
at java.lang.System.arraycopy:-2
at org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl:171
at org.apache.poi.hwpf.model.PAPFormattedDiskPage.:101
at org.apache.poi.hwpf.model.OldPAPBinTable.:49
at org.apache.poi.hwpf.HWPFOldDocument.:105
at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642
at org.apache.tika.parser.microsoft.WordExtractor.parse:153
at org.apache.tika.parser.microsoft.OfficeParser.parse:169
at org.apache.tika.parser.microsoft.OfficeParser.parse:130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2214) ArrayIndexOutOfBoundsException on a valid Word file

2016-12-19 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2214:
-
Attachment: NONCONT.DOC

> ArrayIndexOutOfBoundsException on a valid Word file
> ---
>
> Key: TIKA-2214
> URL: https://issues.apache.org/jira/browse/TIKA-2214
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: NONCONT.DOC
>
>
> On the attached file, which opens with Word, the Tika parser throws the 
> following error:
> java.lang.ArrayIndexOutOfBoundsException: 
>   at java.lang.System.arraycopy:-2
>   at org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl:171
>   at org.apache.poi.hwpf.model.PAPFormattedDiskPage.:101
>   at org.apache.poi.hwpf.model.OldPAPBinTable.:49
>   at org.apache.poi.hwpf.HWPFOldDocument.:105
>   at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642
>   at org.apache.tika.parser.microsoft.WordExtractor.parse:153
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:169
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2213) ArrayIndexOutOfBoundsException on a valid Word file

2016-12-19 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2213:
-
Attachment: biennial - 96.doc

> ArrayIndexOutOfBoundsException on a valid Word file
> ---
>
> Key: TIKA-2213
> URL: https://issues.apache.org/jira/browse/TIKA-2213
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: biennial - 96.doc
>
>
> On the attached file, which opens in Word, Tika parser throws the following 
> error:
> java.lang.ArrayIndexOutOfBoundsException: 
>   at java.lang.System.arraycopy:-2
>   at org.apache.poi.hwpf.model.TextPieceTable.:109
>   at org.apache.poi.hwpf.model.ComplexFileTable.:70
>   at org.apache.poi.hwpf.HWPFOldDocument.:68
>   at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642
>   at org.apache.tika.parser.microsoft.WordExtractor.parse:153
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:169
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2213) ArrayIndexOutOfBoundsException on a valid Word file

2016-12-19 Thread Seva Alekseyev (JIRA)
Seva Alekseyev created TIKA-2213:


 Summary: ArrayIndexOutOfBoundsException on a valid Word file
 Key: TIKA-2213
 URL: https://issues.apache.org/jira/browse/TIKA-2213
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.14
 Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev


On the attached file, which opens in Word, Tika parser throws the following 
error:

java.lang.ArrayIndexOutOfBoundsException: 
at java.lang.System.arraycopy:-2
at org.apache.poi.hwpf.model.TextPieceTable.:109
at org.apache.poi.hwpf.model.ComplexFileTable.:70
at org.apache.poi.hwpf.HWPFOldDocument.:68
at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642
at org.apache.tika.parser.microsoft.WordExtractor.parse:153
at org.apache.tika.parser.microsoft.OfficeParser.parse:169
at org.apache.tika.parser.microsoft.OfficeParser.parse:130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2094) Error parsing .doc file with visio embed

2016-12-19 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2094.
---
Resolution: Not A Problem

Not a problem anymore

> Error parsing .doc file with visio embed
> 
>
> Key: TIKA-2094
> URL: https://issues.apache.org/jira/browse/TIKA-2094
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.13
> Environment: JDK7
>Reporter: wangruochan
> Attachments: testtika.doc, testtika.doc
>
>
> when I try to parse a  .doc file with a visio embeb,an exception occurred, 
> Print  the stacktrace  below:
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> com/microsoft/schemas/office/visio/x2012/main/ConnectsType
>   at 
> com.microsoft.schemas.office.visio.x2012.main.impl.PageContentsTypeImpl.getConnects(Unknown
>  Source)
>   at 
> org.apache.poi.xdgf.usermodel.XDGFBaseContents.onDocumentRead(XDGFBaseContents.java:89)
>   at 
> org.apache.poi.xdgf.usermodel.XDGFPageContents.onDocumentRead(XDGFPageContents.java:73)
>   at 
> org.apache.poi.xdgf.usermodel.XDGFPages.onDocumentRead(XDGFPages.java:94)
>   at 
> org.apache.poi.xdgf.usermodel.XmlVisioDocument.onDocumentRead(XmlVisioDocument.java:108)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160)
>   at 
> org.apache.poi.xdgf.usermodel.XmlVisioDocument.(XmlVisioDocument.java:79)
>   at 
> org.apache.poi.xdgf.extractor.XDGFVisioExtractor.(XDGFVisioExtractor.java:41)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:212)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
>   at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
>   at 
> org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:140)
>   at 
> org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:164)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:208)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at test.apache.tika.Test.main(Test.java:29)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
> Caused by: java.lang.ClassNotFoundException: 
> com.microsoft.schemas.office.visio.x2012.main.ConnectsType
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   ... 30 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2107) Old MS Word files give error while indexing

2016-12-19 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2107.
---
Resolution: Won't Fix

Sorry, POI/Tika don't currently support such old Word files.  If anyone 
contributes a parser to POI, we'll be sure to include it in Tika.  Closing as 
'won't fix' for now.  Please reopen if a Java, Apache 2.0-compatible parser is 
available.

> Old MS Word files give error while indexing
> ---
>
> Key: TIKA-2107
> URL: https://issues.apache.org/jira/browse/TIKA-2107
> Project: Tika
>  Issue Type: Bug
>  Components: tika-batch
>Affects Versions: 1.8, 2.0
> Environment: ubuntu
>Reporter: Gaurav
>  Labels: patch
> Attachments: Tika 2.0 error.jpg, plen281.doc
>
>
> error while indexing old MS word files
> Screen shot of Tika 2.0 attached. 
> Error with Tika 1.8:
> Log of Tika 1.8:
> INFO: meta (application/msword)
> Oct 04, 2016 6:42:30 PM org.apache.tika.server.resource.TikaResource parse
> WARNING: meta: Text extraction failed
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
> org.apache.tika.parser.microsoft.OfficeParser@7260e439
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:287)
>   at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at 
> org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:238)
>   at 
> org.apache.tika.server.resource.MetadataResource.parseMetadata(MetadataResource.java:134)
>   at 
> org.apache.tika.server.resource.MetadataResource.getMetadata(MetadataResource.java:67)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:181)
>   at 
> org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:97)
>   at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:200)
>   at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:99)
>   at 
> org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
>   at 
> org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
>   at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
>   at 
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
>   at 
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
>   at 
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
>   at 
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
>   at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
>   at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
>   at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
>   at org.eclipse.jetty.server.Server.handle(Server.java:370)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043)
>   at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)
>   at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
>   at 
> org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
>   at 
> org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
>   at 
> org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: 

[jira] [Updated] (TIKA-2212) Update mimes for OOXMLParser

2016-12-19 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2212:
--
Description: 
On TIKA-2208, [~dadoonet] found that we are missing the mime for .potm files in 
our OOXMLParser.  Let's add it.

I also found that it was not possible to exclude children or grandchildren of 
"x-tika-ooxml".  We should fix that somehow.

  was:On TIKA-2208, [~dadoonet] found that we are missing the mime for .potm 
files in our OOXMLParser.  Let's add it.


> Update mimes for OOXMLParser
> 
>
> Key: TIKA-2212
> URL: https://issues.apache.org/jira/browse/TIKA-2212
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Trivial
>
> On TIKA-2208, [~dadoonet] found that we are missing the mime for .potm files 
> in our OOXMLParser.  Let's add it.
> I also found that it was not possible to exclude children or grandchildren of 
> "x-tika-ooxml".  We should fix that somehow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2212) Update mimes for OOXMLParser

2016-12-19 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15761132#comment-15761132
 ] 

Tim Allison commented on TIKA-2212:
---

So that users can control includes/excludes with greater precision, I propose 
adding:

{noformat}
application/vnd.ms-powerpoint.slide.macroenabled.12
application/vnd.ms-powerpoint.template.macroenabled.12
application/vnd.openxmlformats-officedocument.presentationml.slide
application/vnd.ms-visio.drawing
application/vnd.ms-visio.drawing.macroenabled.12
application/vnd.ms-visio.stencil
application/vnd.ms-visio.stencil.macroenabled.12
application/vnd.ms-visio.template
application/vnd.ms-visio.template.macroenabled.12
model/vnd.dwfx+xps
{noformat}

and removing {{x-tika-ooxml}} from OOXMLParser's {{SUPPORTED_TYPES}}.

Some questions:
1) Does the OOXMLParser actually support all of these or should some be moved 
to the {{UNSUPPORTED_OOXML_TYPES}}?
2) Any objections to this proposal? (ping [~gagravarr])

> Update mimes for OOXMLParser
> 
>
> Key: TIKA-2212
> URL: https://issues.apache.org/jira/browse/TIKA-2212
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Trivial
>
> On TIKA-2208, [~dadoonet] found that we are missing the mime for .potm files 
> in our OOXMLParser.  Let's add it.
> I also found that it was not possible to exclude children or grandchildren of 
> "x-tika-ooxml".  We should fix that somehow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-2212) Update mimes for OOXMLParser

2016-12-19 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15761116#comment-15761116
 ] 

Tim Allison edited comment on TIKA-2212 at 12/19/16 1:15 PM:
-

If we run this:
{noformat}
TikaConfig tikaConfig = TikaConfig.getDefaultConfig();

MediaTypeRegistry registry = tikaConfig.getMediaTypeRegistry();
for (MediaType child : 
registry.getChildTypes(MediaType.application("x-tika-ooxml"))) {
//System.out.println(child);
if (! OOXMLParser.SUPPORTED_TYPES.contains(child) && 
! OOXMLParser.UNSUPPORTED_OOXML_TYPES.contains(child)) {
System.out.println("Falling between the cracks: " + child);
}
for (MediaType grandchild : registry.getChildTypes(child)) {
if (! OOXMLParser.SUPPORTED_TYPES.contains(grandchild) && 
! 
OOXMLParser.UNSUPPORTED_OOXML_TYPES.contains(grandchild)) {
System.out.println("Falling between the cracks grandchild: 
" + grandchild);
}
}
}
{noformat}

We get this:
{noformat}
Falling between the cracks: application/vnd.ms-powerpoint.slide.macroenabled.12
Falling between the cracks: 
application/vnd.ms-powerpoint.template.macroenabled.12
Falling between the cracks: 
application/vnd.openxmlformats-officedocument.presentationml.slide
Falling between the cracks: application/x-tika-ooxml-protected
Falling between the cracks: application/x-tika-visio-ooxml
Falling between the cracks grandchild: application/vnd.ms-visio.drawing
Falling between the cracks grandchild: 
application/vnd.ms-visio.drawing.macroenabled.12
Falling between the cracks grandchild: application/vnd.ms-visio.stencil
Falling between the cracks grandchild: 
application/vnd.ms-visio.stencil.macroenabled.12
Falling between the cracks grandchild: application/vnd.ms-visio.template
Falling between the cracks grandchild: 
application/vnd.ms-visio.template.macroenabled.12
Falling between the cracks: model/vnd.dwfx+xps
{noformat}


was (Author: talli...@mitre.org):
If we run this:
{noformat}
TikaConfig tikaConfig = TikaConfig.getDefaultConfig();

MediaTypeRegistry registry = tikaConfig.getMediaTypeRegistry();
for (MediaType child : 
registry.getChildTypes(MediaType.application("x-tika-ooxml"))) {
//System.out.println(child);
if (! OOXMLParser.SUPPORTED_TYPES.contains(child) && ! 
OOXMLParser.UNSUPPORTED_OOXML_TYPES.contains(child)) {
System.out.println("Falling between the cracks: "+child);
}
for (MediaType grandchild : registry.getChildTypes(child)) {
if (! OOXMLParser.SUPPORTED_TYPES.contains(child) && ! 
OOXMLParser.UNSUPPORTED_OOXML_TYPES.contains(child)) {
System.out.println("Falling between the cracks grandchild: 
"+grandchild);
}
}
}
{noformat}

We get this:
{noformat}
Falling between the cracks: application/vnd.ms-powerpoint.slide.macroenabled.12
Falling between the cracks: 
application/vnd.ms-powerpoint.template.macroenabled.12
Falling between the cracks: 
application/vnd.openxmlformats-officedocument.presentationml.slide
Falling between the cracks: application/x-tika-ooxml-protected
Falling between the cracks: application/x-tika-visio-ooxml
Falling between the cracks grandchild: application/vnd.ms-visio.drawing
Falling between the cracks grandchild: 
application/vnd.ms-visio.drawing.macroenabled.12
Falling between the cracks grandchild: application/vnd.ms-visio.stencil
Falling between the cracks grandchild: 
application/vnd.ms-visio.stencil.macroenabled.12
Falling between the cracks grandchild: application/vnd.ms-visio.template
Falling between the cracks grandchild: 
application/vnd.ms-visio.template.macroenabled.12
Falling between the cracks: model/vnd.dwfx+xps
{noformat}

> Update mimes for OOXMLParser
> 
>
> Key: TIKA-2212
> URL: https://issues.apache.org/jira/browse/TIKA-2212
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Trivial
>
> On TIKA-2208, [~dadoonet] found that we are missing the mime for .potm files 
> in our OOXMLParser.  Let's add it.
> I also found that it was not possible to exclude children or grandchildren of 
> "x-tika-ooxml".  We should fix that somehow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2212) Update mimes for OOXMLParser

2016-12-19 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2212:
--
Summary: Update mimes for OOXMLParser  (was: Add mime for .potm to 
OOXMLParser)

> Update mimes for OOXMLParser
> 
>
> Key: TIKA-2212
> URL: https://issues.apache.org/jira/browse/TIKA-2212
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Trivial
>
> On TIKA-2208, [~dadoonet] found that we are missing the mime for .potm files 
> in our OOXMLParser.  Let's add it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2212) Add mime for .potm to OOXMLParser

2016-12-19 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15761116#comment-15761116
 ] 

Tim Allison commented on TIKA-2212:
---

If we run this:
{noformat}
TikaConfig tikaConfig = TikaConfig.getDefaultConfig();

MediaTypeRegistry registry = tikaConfig.getMediaTypeRegistry();
for (MediaType child : 
registry.getChildTypes(MediaType.application("x-tika-ooxml"))) {
//System.out.println(child);
if (! OOXMLParser.SUPPORTED_TYPES.contains(child) && ! 
OOXMLParser.UNSUPPORTED_OOXML_TYPES.contains(child)) {
System.out.println("Falling between the cracks: "+child);
}
for (MediaType grandchild : registry.getChildTypes(child)) {
if (! OOXMLParser.SUPPORTED_TYPES.contains(child) && ! 
OOXMLParser.UNSUPPORTED_OOXML_TYPES.contains(child)) {
System.out.println("Falling between the cracks grandchild: 
"+grandchild);
}
}
}
{noformat}

We get this:
{noformat}
Falling between the cracks: application/vnd.ms-powerpoint.slide.macroenabled.12
Falling between the cracks: 
application/vnd.ms-powerpoint.template.macroenabled.12
Falling between the cracks: 
application/vnd.openxmlformats-officedocument.presentationml.slide
Falling between the cracks: application/x-tika-ooxml-protected
Falling between the cracks: application/x-tika-visio-ooxml
Falling between the cracks grandchild: application/vnd.ms-visio.drawing
Falling between the cracks grandchild: 
application/vnd.ms-visio.drawing.macroenabled.12
Falling between the cracks grandchild: application/vnd.ms-visio.stencil
Falling between the cracks grandchild: 
application/vnd.ms-visio.stencil.macroenabled.12
Falling between the cracks grandchild: application/vnd.ms-visio.template
Falling between the cracks grandchild: 
application/vnd.ms-visio.template.macroenabled.12
Falling between the cracks: model/vnd.dwfx+xps
{noformat}

> Add mime for .potm to OOXMLParser
> -
>
> Key: TIKA-2212
> URL: https://issues.apache.org/jira/browse/TIKA-2212
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Trivial
>
> On TIKA-2208, [~dadoonet] found that we are missing the mime for .potm files 
> in our OOXMLParser.  Let's add it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2212) Add mime for .potm to OOXMLParser

2016-12-19 Thread Tim Allison (JIRA)
Tim Allison created TIKA-2212:
-

 Summary: Add mime for .potm to OOXMLParser
 Key: TIKA-2212
 URL: https://issues.apache.org/jira/browse/TIKA-2212
 Project: Tika
  Issue Type: Bug
Reporter: Tim Allison
Priority: Trivial


On TIKA-2208, [~dadoonet] found that we are missing the mime for .potm files in 
our OOXMLParser.  Let's add it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-2208) Catch missing libraires

2016-12-19 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15761055#comment-15761055
 ] 

Tim Allison edited comment on TIKA-2208 at 12/19/16 12:54 PM:
--

Three cheers for unit tests!

It looks like we need to add vnd.ms-powerpoint.template.macroenabled.12 to 
OOXMLParser's handled media types.  I'll make that change shortly.

Meanwhile, you could try something like this, which runs against nearly all of 
our test documents:

{noformat}
private static final Set INCLUDES = new HashSet<>();

static {
for (MediaType mediaType : OOXMLParser.SUPPORTED_TYPES) {
if (mediaType.equals(MediaType.application("x-tika-ooxml"))) {
continue;
}
INCLUDES.add(mediaType);
}

INCLUDES.add(MediaType.application("vnd.ms-powerpoint.template.macroenabled.12"));
}

private static final Set EXCLUDES =
Collections.unmodifiableSet(new HashSet<>(Arrays.asList(
MediaType.application("x-tika-ooxml")
)));

private static final Parser DECORATED_PARSERS[] = new Parser[] {
// documents
new org.apache.tika.parser.html.HtmlParser(),
new org.apache.tika.parser.rtf.RTFParser(),
new org.apache.tika.parser.pdf.PDFParser(),
new org.apache.tika.parser.txt.TXTParser(),
new org.apache.tika.parser.microsoft.OfficeParser(),
new org.apache.tika.parser.microsoft.OldExcelParser(),
ParserDecorator.withTypes(
ParserDecorator.withoutTypes(
new 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser(), EXCLUDES
), INCLUDES),
new org.apache.tika.parser.odf.OpenDocumentParser(),
new org.apache.tika.parser.iwork.IWorkPackageParser(),
new org.apache.tika.parser.xml.DcXMLParser(),
new org.apache.tika.parser.epub.EpubParser(),
};

private static final Parser STANDARD_PARSERS[] = new Parser[] {
// documents
new org.apache.tika.parser.html.HtmlParser(),
new org.apache.tika.parser.rtf.RTFParser(),
new org.apache.tika.parser.pdf.PDFParser(),
new org.apache.tika.parser.txt.TXTParser(),
new org.apache.tika.parser.microsoft.OfficeParser(),
new org.apache.tika.parser.microsoft.OldExcelParser(),
new org.apache.tika.parser.microsoft.ooxml.OOXMLParser(),
new org.apache.tika.parser.odf.OpenDocumentParser(),
new org.apache.tika.parser.iwork.IWorkPackageParser(),
new org.apache.tika.parser.xml.DcXMLParser(),
new org.apache.tika.parser.epub.EpubParser(),
};

private static final AutoDetectParser DECORATED_PARSER_INSTANCE = new 
AutoDetectParser(DECORATED_PARSERS);
private static final AutoDetectParser STANDARD_PARSER_INSTANCE = new 
AutoDetectParser(STANDARD_PARSERS);

private static final Tika DECORATED_TIKA = new 
Tika(DECORATED_PARSER_INSTANCE.getDetector(), DECORATED_PARSER_INSTANCE);
private static final Tika STANDARD_TIKA = new 
Tika(STANDARD_PARSER_INSTANCE.getDetector(), STANDARD_PARSER_INSTANCE);

@Test
public void testSkipVisioOOXML() throws Exception {

for (File f : getResourceAsFile("/test-documents").listFiles()) {
if (f.isDirectory()) {
continue;
}

if (f.getName().contains("VISIO") && (f.getName().endsWith("x") || 
f.getName().endsWith("m"))) {
continue;
}

if (f.getName().contains("embeddedVsdx")) {
continue;
}


boolean decoratedEx = false;
boolean standardEx = false;
String decoratedOutput = "";
String standardOutput = "";

try (InputStream is = TikaInputStream.get(f)) {
decoratedOutput = DECORATED_TIKA.parseToString(is);
} catch (Throwable e) {
decoratedEx = true;
}
try (InputStream is = TikaInputStream.get(f)) {
standardOutput = STANDARD_TIKA.parseToString(is);
} catch (Throwable e) {
standardEx = true;
}
assertEquals(f.getName(), standardEx, decoratedEx);

if (standardEx == false) {
assertEquals(f.getName(), standardOutput, decoratedOutput);
}
}

}
{noformat}


was (Author: talli...@mitre.org):
Three cheers for unit tests!

It looks like we need to add vnd.ms-powerpoint.template.macroenabled.12 to 
OOXMLParser's handled media types.  I'll make that change shortly.

Meanwhile, you could try something like this, which runs against nearly all of 
our test documents:

{noformat}
private static final Set INCLUDES = new HashSet<>();

static {
 

[jira] [Commented] (TIKA-2208) Catch missing libraires

2016-12-19 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15761085#comment-15761085
 ] 

Tim Allison commented on TIKA-2208:
---

Ugh.  Rather than including the two clashing subsets of the ooxml-schemas, you 
could include the full 
[ooxml-schemas|https://mvnrepository.com/artifact/org.apache.poi/ooxml-schemas/1.3].
  That does weigh in at 15MB, but it includes everything.

> Catch missing libraires
> ---
>
> Key: TIKA-2208
> URL: https://issues.apache.org/jira/browse/TIKA-2208
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: David Pilato
>
> Hi there
> We have decided to remove support for some formats when using Tika to extract 
> text and metadata.
> We defined our list of Parsers:
> {code:java}
> private static final Parser PARSERS[] = new Parser[] {
> // documents
> new org.apache.tika.parser.html.HtmlParser(),
> new org.apache.tika.parser.rtf.RTFParser(),
> new org.apache.tika.parser.pdf.PDFParser(),
> new org.apache.tika.parser.txt.TXTParser(),
> new org.apache.tika.parser.microsoft.OfficeParser(),
> new org.apache.tika.parser.microsoft.OldExcelParser(),
> new org.apache.tika.parser.microsoft.ooxml.OOXMLParser(),
> new org.apache.tika.parser.odf.OpenDocumentParser(),
> new org.apache.tika.parser.iwork.IWorkPackageParser(),
> new org.apache.tika.parser.xml.DcXMLParser(),
> new org.apache.tika.parser.epub.EpubParser(),
> };
> private static final AutoDetectParser PARSER_INSTANCE = new 
> AutoDetectParser(PARSERS);
> private static final Tika TIKA_INSTANCE = new 
> Tika(PARSER_INSTANCE.getDetector(), PARSER_INSTANCE);
> {code}
> But when a MS Office Word document embeds another non supported document 
> (Like a Visio Schema) an {{NoClassDefFoundError}} is raised.
> Would it be possible to catch such a case and throw in that case a 
> {{TikaException}} so it behaves as an Exception and not as a Throwable?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2208) Catch missing libraires

2016-12-19 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15761071#comment-15761071
 ] 

Tim Allison commented on TIKA-2208:
---

Note, too, that the test also passes if you add:

{noformat}
if (f.getName().equals("testPPT.potm")) {
assertContains("Watershed", decoratedOutput);
}
{noformat}


:)

> Catch missing libraires
> ---
>
> Key: TIKA-2208
> URL: https://issues.apache.org/jira/browse/TIKA-2208
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: David Pilato
>
> Hi there
> We have decided to remove support for some formats when using Tika to extract 
> text and metadata.
> We defined our list of Parsers:
> {code:java}
> private static final Parser PARSERS[] = new Parser[] {
> // documents
> new org.apache.tika.parser.html.HtmlParser(),
> new org.apache.tika.parser.rtf.RTFParser(),
> new org.apache.tika.parser.pdf.PDFParser(),
> new org.apache.tika.parser.txt.TXTParser(),
> new org.apache.tika.parser.microsoft.OfficeParser(),
> new org.apache.tika.parser.microsoft.OldExcelParser(),
> new org.apache.tika.parser.microsoft.ooxml.OOXMLParser(),
> new org.apache.tika.parser.odf.OpenDocumentParser(),
> new org.apache.tika.parser.iwork.IWorkPackageParser(),
> new org.apache.tika.parser.xml.DcXMLParser(),
> new org.apache.tika.parser.epub.EpubParser(),
> };
> private static final AutoDetectParser PARSER_INSTANCE = new 
> AutoDetectParser(PARSERS);
> private static final Tika TIKA_INSTANCE = new 
> Tika(PARSER_INSTANCE.getDetector(), PARSER_INSTANCE);
> {code}
> But when a MS Office Word document embeds another non supported document 
> (Like a Visio Schema) an {{NoClassDefFoundError}} is raised.
> Would it be possible to catch such a case and throw in that case a 
> {{TikaException}} so it behaves as an Exception and not as a Throwable?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-2208) Catch missing libraires

2016-12-19 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15761055#comment-15761055
 ] 

Tim Allison edited comment on TIKA-2208 at 12/19/16 12:40 PM:
--

Three cheers for unit tests!

It looks like we need to add vnd.ms-powerpoint.template.macroenabled.12 to 
OOXMLParser's handled media types.  I'll make that change shortly.

Meanwhile, you could try something like this, which runs against nearly all of 
our test documents:

{noformat}
private static final Set INCLUDES = new HashSet<>();

static {
for (MediaType mediaType : OOXMLParser.SUPPORTED_TYPES) {
if (mediaType.equals(MediaType.application("x-tika-ooxml"))) {
continue;
}
INCLUDES.add(mediaType);
}

INCLUDES.add(MediaType.application("vnd.ms-powerpoint.template.macroenabled.12"));
}

private static final Set EXCLUDES =
Collections.unmodifiableSet(new HashSet<>(Arrays.asList(
MediaType.application("x-tika-ooxml")
)));

private static final Parser DECORATED_PARSERS[] = new Parser[] {
// documents
new org.apache.tika.parser.html.HtmlParser(),
new org.apache.tika.parser.rtf.RTFParser(),
new org.apache.tika.parser.pdf.PDFParser(),
new org.apache.tika.parser.txt.TXTParser(),
new org.apache.tika.parser.microsoft.OfficeParser(),
new org.apache.tika.parser.microsoft.OldExcelParser(),
ParserDecorator.withTypes(ParserDecorator.withoutTypes(
new org.apache.tika.parser.microsoft.ooxml.OOXMLParser(),
EXCLUDES
), INCLUDES),
new org.apache.tika.parser.odf.OpenDocumentParser(),
new org.apache.tika.parser.iwork.IWorkPackageParser(),
new org.apache.tika.parser.xml.DcXMLParser(),
new org.apache.tika.parser.epub.EpubParser(),
};

private static final Parser STANDARD_PARSERS[] = new Parser[] {
// documents
new org.apache.tika.parser.html.HtmlParser(),
new org.apache.tika.parser.rtf.RTFParser(),
new org.apache.tika.parser.pdf.PDFParser(),
new org.apache.tika.parser.txt.TXTParser(),
new org.apache.tika.parser.microsoft.OfficeParser(),
new org.apache.tika.parser.microsoft.OldExcelParser(),
new org.apache.tika.parser.microsoft.ooxml.OOXMLParser(),
new org.apache.tika.parser.odf.OpenDocumentParser(),
new org.apache.tika.parser.iwork.IWorkPackageParser(),
new org.apache.tika.parser.xml.DcXMLParser(),
new org.apache.tika.parser.epub.EpubParser(),
};

private static final AutoDetectParser DECORATED_PARSER_INSTANCE = new 
AutoDetectParser(DECORATED_PARSERS);
private static final AutoDetectParser STANDARD_PARSER_INSTANCE = new 
AutoDetectParser(STANDARD_PARSERS);

private static final Tika DECORATED_TIKA = new 
Tika(DECORATED_PARSER_INSTANCE.getDetector(), DECORATED_PARSER_INSTANCE);
private static final Tika STANDARD_TIKA = new 
Tika(STANDARD_PARSER_INSTANCE.getDetector(), STANDARD_PARSER_INSTANCE);

@Test
public void testSkipVisioOOXML() throws Exception {

for (File f : getResourceAsFile("/test-documents").listFiles()) {
if (f.isDirectory()) {
continue;
}

if (f.getName().contains("VISIO") && (f.getName().endsWith("x") || 
f.getName().endsWith("m"))) {
continue;
}

if (f.getName().contains("embeddedVsdx")) {
continue;
}


boolean decoratedEx = false;
boolean standardEx = false;
String decoratedOutput = "";
String standardOutput = "";

try (InputStream is = TikaInputStream.get(f)) {
decoratedOutput = DECORATED_TIKA.parseToString(is);
} catch (Throwable e) {
decoratedEx = true;
}
try (InputStream is = TikaInputStream.get(f)) {
standardOutput = STANDARD_TIKA.parseToString(is);
} catch (Throwable e) {
standardEx = true;
}
assertEquals(f.getName(), standardEx, decoratedEx);

if (standardEx == false) {
assertEquals(f.getName(), standardOutput, decoratedOutput);
}
}

}
{noformat}


was (Author: talli...@mitre.org):
Three cheers for unit tests!

It looks like we need to add vnd.ms-powerpoint.template.macroenabled.12 to 
OOXMLParser's handled media types.  I'll make that change shortly.

Meanwhile, you could try something like this, which runs against nearly all of 
our test documents:

private static final Set INCLUDES = new HashSet<>();

static {
for 

[jira] [Commented] (TIKA-2208) Catch missing libraires

2016-12-19 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15761055#comment-15761055
 ] 

Tim Allison commented on TIKA-2208:
---

Three cheers for unit tests!

It looks like we need to add vnd.ms-powerpoint.template.macroenabled.12 to 
OOXMLParser's handled media types.  I'll make that change shortly.

Meanwhile, you could try something like this, which runs against nearly all of 
our test documents:

private static final Set INCLUDES = new HashSet<>();

static {
for (MediaType mediaType : OOXMLParser.SUPPORTED_TYPES) {
if (mediaType.equals(MediaType.application("x-tika-ooxml"))) {
continue;
}
INCLUDES.add(mediaType);
}

INCLUDES.add(MediaType.application("vnd.ms-powerpoint.template.macroenabled.12"));
}

private static final Set EXCLUDES =
Collections.unmodifiableSet(new HashSet<>(Arrays.asList(
MediaType.application("x-tika-ooxml")
)));

private static final Parser DECORATED_PARSERS[] = new Parser[] {
// documents
new org.apache.tika.parser.html.HtmlParser(),
new org.apache.tika.parser.rtf.RTFParser(),
new org.apache.tika.parser.pdf.PDFParser(),
new org.apache.tika.parser.txt.TXTParser(),
new org.apache.tika.parser.microsoft.OfficeParser(),
new org.apache.tika.parser.microsoft.OldExcelParser(),
ParserDecorator.withTypes(ParserDecorator.withoutTypes(
new org.apache.tika.parser.microsoft.ooxml.OOXMLParser(),
EXCLUDES
), INCLUDES),
new org.apache.tika.parser.odf.OpenDocumentParser(),
new org.apache.tika.parser.iwork.IWorkPackageParser(),
new org.apache.tika.parser.xml.DcXMLParser(),
new org.apache.tika.parser.epub.EpubParser(),
};

private static final Parser STANDARD_PARSERS[] = new Parser[] {
// documents
new org.apache.tika.parser.html.HtmlParser(),
new org.apache.tika.parser.rtf.RTFParser(),
new org.apache.tika.parser.pdf.PDFParser(),
new org.apache.tika.parser.txt.TXTParser(),
new org.apache.tika.parser.microsoft.OfficeParser(),
new org.apache.tika.parser.microsoft.OldExcelParser(),
new org.apache.tika.parser.microsoft.ooxml.OOXMLParser(),
new org.apache.tika.parser.odf.OpenDocumentParser(),
new org.apache.tika.parser.iwork.IWorkPackageParser(),
new org.apache.tika.parser.xml.DcXMLParser(),
new org.apache.tika.parser.epub.EpubParser(),
};

private static final AutoDetectParser DECORATED_PARSER_INSTANCE = new 
AutoDetectParser(DECORATED_PARSERS);
private static final AutoDetectParser STANDARD_PARSER_INSTANCE = new 
AutoDetectParser(STANDARD_PARSERS);

private static final Tika DECORATED_TIKA = new 
Tika(DECORATED_PARSER_INSTANCE.getDetector(), DECORATED_PARSER_INSTANCE);
private static final Tika STANDARD_TIKA = new 
Tika(STANDARD_PARSER_INSTANCE.getDetector(), STANDARD_PARSER_INSTANCE);

@Test
public void testSkipVisioOOXML() throws Exception {

for (File f : getResourceAsFile("/test-documents").listFiles()) {
if (f.isDirectory()) {
continue;
}

if (f.getName().contains("VISIO") && (f.getName().endsWith("x") || 
f.getName().endsWith("m"))) {
continue;
}

if (f.getName().contains("embeddedVsdx")) {
continue;
}


boolean decoratedEx = false;
boolean standardEx = false;
String decoratedOutput = "";
String standardOutput = "";

try (InputStream is = TikaInputStream.get(f)) {
decoratedOutput = DECORATED_TIKA.parseToString(is);
} catch (Throwable e) {
decoratedEx = true;
}
try (InputStream is = TikaInputStream.get(f)) {
standardOutput = STANDARD_TIKA.parseToString(is);
} catch (Throwable e) {
standardEx = true;
}
assertEquals(f.getName(), standardEx, decoratedEx);

if (standardEx == false) {
assertEquals(f.getName(), standardOutput, decoratedOutput);
}
}

}


> Catch missing libraires
> ---
>
> Key: TIKA-2208
> URL: https://issues.apache.org/jira/browse/TIKA-2208
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: David Pilato
>
> Hi there
> We have decided to remove support for some formats when using Tika to extract 
> text and metadata.
> We defined our list of Parsers:
> {code:java}
> private static final 

[jira] [Commented] (TIKA-2211) ePub formatting instructions appear in plain text output

2016-12-19 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15760978#comment-15760978
 ] 

Tim Allison commented on TIKA-2211:
---

The ePub parser is using a straight SAXParser with no modifications.  Looks 
like we should modify it slightly to ignore  sections?

{noformat}

  
  
/**/
  

{noformat}

> ePub formatting instructions appear in plain text output
> 
>
> Key: TIKA-2211
> URL: https://issues.apache.org/jira/browse/TIKA-2211
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.14
> Environment: I tested this on on Mac OSX 10.11.6 with Oracle JDK 
> 1.8.0_112.  The Tika stand-alone application was launched as follows:
> {code}
> java -jar tika-app-1.14.jar
> {code}
>Reporter: Adam Carroll
>
> For some ePub files, format information appears in the plain text output 
> produced by Apache Tika.  For example the Tika stand-alone application shows 
> the following text for the file “Don Quijote de la Mancha - Miguel de 
> Cervantes.epub” (dowloaded 
> [here|http://www.literanda.com/don-quijote-de-la-mancha--miguel-de-cervantes--epub]):
> {code}
> /**/
>   p.sgc-2 {font-style: italic; text-align: right}
>   p.sgc-1 {text-align: justify;}
>   h3.sgc-3 {text-align: center;}
>   /**/
> Al duque de Béjar
> Marqués de Gibraleón, conde de Benalcázar y Bañares, vizconde de La Puebla de 
> Alcocer, señor de las villas de Capilla, Curiel y Burguillos
> En fe del buen acogimiento y honra que hace Vuestra Excelencia a toda suerte 
> de libros, como príncipe tan inclinado a favorecer las buenas artes, 
> mayormente las que por su nobleza no se abaten al servicio y granjerías del 
> vulgo, he determinado de sacar a luz El ingenioso hidalgo don Quijote de la 
> Mancha, al abrigo del clarísimo nombre de Vuestra Excelencia, a quien, con el 
> acatamiento que debo a tanta grandeza, suplico le reciba agradablemente en su 
> protección, para que a su sombra, aunque desnudo de aquel precioso ornamento 
> de elegancia y erudición de que suelen andar vestidas las obras que se 
> componen en las casas de los hombres que saben, ose parecer seguramente en el 
> juicio de algunos que, conteniéndose en los límites de su ignorancia, suelen 
> condenar con más rigor y menos justicia los trabajos ajenos; que, poniendo 
> los ojos la prudencia de Vuestra Excelencia en mi buen deseo, fío que no 
> desdeñará la cortedad de tan humilde servicio.
> {code}
> To reproduce this problem run the stand-alone version of Tika and open an 
> affected ePub file such as the one mentioned above.  Then go to View -> Plain 
> Text.  You should see the problem there.
> By the way, thanks for making Apache Tika a really useful library.  Keep up 
> the good work!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)