[jira] [Updated] (TIKA-2219) CharsetDetector no longer detects windows-1252 charset
[ https://issues.apache.org/jira/browse/TIKA-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pascal Essiembre updated TIKA-2219: --- Description: Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is always detected instead. While not tested, this likely affects other windows-125* encodings as well. I tracked it down to a change in the {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method. Now it always returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? "windows-1252" : "ISO-8859-1";}} Now that condition has been moved to the {{match(CharsetDetector det)}} method so that the returned CharsetMatch has the proper name. The problem with that is {{CharsetDetector#detectAll()}} method overwrites the correct match with a new one that will return the value of {{#getName()}} from the {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case). There might be legitimate reasons why the {{CharsetMatch}} instances in {{detectAll()}} method are replaced with new ones, but changing this code in that method appears to work for me: // Remove this: //CharsetMatch m = new CharsetMatch(this, csr, confidence); //matches.add(m); // Add this instead: matches.add(charsetMatch); was: Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is always detected instead. While not tested, this likely affects other windows-125* encodings. I tracked it down to a change in the {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method. Now it always returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? "windows-1252" : "ISO-8859-1";}} Now that condition has been moved to the {{match(CharsetDetector det)}} method so that the returned CharsetMatch has the proper name. The problem with that is {{CharsetDetector#detectAll()}} method overwrites the correct match with a new one that will return the value of {{#getName()}} from the {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case). There might be legitimate reasons why the {{CharsetMatch}} instances in {{detectAll()}} method are replaced with new ones, but changing this code in that method appears to work for me: // Remove this: //CharsetMatch m = new CharsetMatch(this, csr, confidence); //matches.add(m); // Add this instead: matches.add(charsetMatch); > CharsetDetector no longer detects windows-1252 charset > -- > > Key: TIKA-2219 > URL: https://issues.apache.org/jira/browse/TIKA-2219 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Any. >Reporter: Pascal Essiembre >Priority: Minor > > Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is > always detected instead. While not tested, this likely affects other > windows-125* encodings as well. > I tracked it down to a change in the > {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method. Now it always > returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? > "windows-1252" : "ISO-8859-1";}} > Now that condition has been moved to the {{match(CharsetDetector det)}} > method so that the returned CharsetMatch has the proper name. The problem > with that is {{CharsetDetector#detectAll()}} method overwrites the correct > match with a new one that will return the value of {{#getName()}} from the > {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case). > There might be legitimate reasons why the {{CharsetMatch}} instances in > {{detectAll()}} method are replaced with new ones, but changing this code in > that method appears to work for me: > // Remove this: > //CharsetMatch m = new CharsetMatch(this, csr, > confidence); > //matches.add(m); > // Add this instead: > matches.add(charsetMatch); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2219) CharsetDetector no longer detects windows-1252 charset
Pascal Essiembre created TIKA-2219: -- Summary: CharsetDetector no longer detects windows-1252 charset Key: TIKA-2219 URL: https://issues.apache.org/jira/browse/TIKA-2219 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.14 Environment: Any. Reporter: Pascal Essiembre Priority: Minor Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is always detected instead. While not tested, this likely affects other windows-125* encodings. I tracked it down to a change in the {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method. Now it always returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? "windows-1252" : "ISO-8859-1";}} Now that condition has been moved to the {{match(CharsetDetector det)}} method so that the returned CharsetMatch has the proper name. The problem with that is {{CharsetDetector#detectAll()}} method overwrites the correct match with a new one that will return the value of {{#getName()}} from the {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case). There might be legitimate reasons why the {{CharsetMatch}} instances in {{detectAll()}} method are replaced with new ones, but changing this code in that method appears to work for me: // Remove this: //CharsetMatch m = new CharsetMatch(this, csr, confidence); //matches.add(m); // Add this instead: matches.add(charsetMatch); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2218) Add a few more places where PPTX relationships might include an attachment
[ https://issues.apache.org/jira/browse/TIKA-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762403#comment-15762403 ] Hudson commented on TIKA-2218: -- SUCCESS: Integrated in Jenkins build tika-2.x #182 (See [https://builds.apache.org/job/tika-2.x/182/]) TIKA-2218 -- add a new new locations within a pptx to check for (tallison: rev 4f04b6c3e9645bfe5fdb7d7f1078051c0eca7fcc) * (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java * (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java * (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLContainerExtractionTest.java > Add a few more places where PPTX relationships might include an attachment > -- > > Key: TIKA-2218 > URL: https://issues.apache.org/jira/browse/TIKA-2218 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Fix For: 2.0, 1.15 > > > Slide masters, the overall master, handout, notes and notes master can all > contain embedded objects. Let's add those to the {{mainDocumentParts}} in > pptx. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2218) Add a few more places where PPTX relationships might include an attachment
[ https://issues.apache.org/jira/browse/TIKA-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762394#comment-15762394 ] Hudson commented on TIKA-2218: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1161 (See [https://builds.apache.org/job/Tika-trunk/1161/]) TIKA-2218 -- add a few more places where .pptx can include embedded (tallison: rev ca37313a716d4eaa3a15a4ba770f89ee23832e99) * (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXSLFPowerPointExtractorDecorator.java * (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLContainerExtractionTest.java * (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java * (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXSLFExtractorTest.java > Add a few more places where PPTX relationships might include an attachment > -- > > Key: TIKA-2218 > URL: https://issues.apache.org/jira/browse/TIKA-2218 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Fix For: 2.0, 1.15 > > > Slide masters, the overall master, handout, notes and notes master can all > contain embedded objects. Let's add those to the {{mainDocumentParts}} in > pptx. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-2218) Add a few more places where PPTX relationships might include an attachment
[ https://issues.apache.org/jira/browse/TIKA-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2218. --- Resolution: Fixed Fix Version/s: 1.15 2.0 > Add a few more places where PPTX relationships might include an attachment > -- > > Key: TIKA-2218 > URL: https://issues.apache.org/jira/browse/TIKA-2218 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Fix For: 2.0, 1.15 > > > Slide masters, the overall master, handout, notes and notes master can all > contain embedded objects. Let's add those to the {{mainDocumentParts}} in > pptx. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2218) Add a few more places where PPTX relationships might include an attachment
[ https://issues.apache.org/jira/browse/TIKA-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762337#comment-15762337 ] Hudson commented on TIKA-2218: -- FAILURE: Integrated in Jenkins build tika-2.x-windows #83 (See [https://builds.apache.org/job/tika-2.x-windows/83/]) TIKA-2218 -- add a new new locations within a pptx to check for (tallison: rev 4f04b6c3e9645bfe5fdb7d7f1078051c0eca7fcc) * (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java * (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLContainerExtractionTest.java * (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java > Add a few more places where PPTX relationships might include an attachment > -- > > Key: TIKA-2218 > URL: https://issues.apache.org/jira/browse/TIKA-2218 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > > Slide masters, the overall master, handout, notes and notes master can all > contain embedded objects. Let's add those to the {{mainDocumentParts}} in > pptx. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
tika-2.x-windows - Build # 83 - Still Failing
The Apache Jenkins build system has built tika-2.x-windows (build #83) Status: Still Failing Check console output at https://builds.apache.org/job/tika-2.x-windows/83/ to view the results.
[jira] [Created] (TIKA-2218) Add a few more places where PPTX relationships might include an attachment
Tim Allison created TIKA-2218: - Summary: Add a few more places where PPTX relationships might include an attachment Key: TIKA-2218 URL: https://issues.apache.org/jira/browse/TIKA-2218 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Slide masters, the overall master, handout, notes and notes master can all contain embedded objects. Let's add those to the {{mainDocumentParts}} in pptx. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2217) RuntimeException on a PPT with a movie
[ https://issues.apache.org/jira/browse/TIKA-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2217: - Description: https://dl.dropboxusercontent.com/u/92341073/lecture%20WH%202002.ppt java.lang.RuntimeException for : "Couldn't instantiate the class for type with id 1000 on class class org.apache.poi.hslf.record.Document : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 1033 on class class org.apache.poi.hslf.record.ExObjList : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 4103 on class class org.apache.poi.hslf.record.ExMCIMovie : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 4101 on class class org.apache.poi.hslf.record.ExVideoContainer : java.lang.reflect.InvocationTargetException Cause was : java.lang.ArrayIndexOutOfBoundsException: 1" java.lang.RuntimeException: Couldn't instantiate the class for type with id 1000 on class class org.apache.poi.hslf.record.Document : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 1033 on class class org.apache.poi.hslf.record.ExObjList : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 4103 on class class org.apache.poi.hslf.record.ExMCIMovie : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 4101 on class class org.apache.poi.hslf.record.ExVideoContainer : java.lang.reflect.InvocationTargetException Cause was : java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.poi.hslf.record.Record.createRecordForType:185 at org.apache.poi.hslf.record.Record.buildRecordAtOffset:103 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read:276 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords:257 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.:161 at org.apache.poi.hslf.usermodel.HSLFSlideShow.:154 at org.apache.tika.parser.microsoft.HSLFExtractor.parse:65 at org.apache.tika.parser.microsoft.OfficeParser.parse:172 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 Caused by: java.lang.reflect.InvocationTargetException: at sun.reflect.GeneratedConstructorAccessor45.newInstance:-1 at sun.reflect.DelegatingConstructorAccessorImpl.newInstance:-1 at java.lang.reflect.Constructor.newInstance:-1 at org.apache.poi.hslf.record.Record.createRecordForType:181 at org.apache.poi.hslf.record.Record.buildRecordAtOffset:103 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read:276 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords:257 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.:161 at org.apache.poi.hslf.usermodel.HSLFSlideShow.:154 at org.apache.tika.parser.microsoft.HSLFExtractor.parse:65 at org.apache.tika.parser.microsoft.OfficeParser.parse:172 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 Caused by: java.lang.RuntimeException: Couldn't instantiate the class for type with id 1033 on class class org.apache.poi.hslf.record.ExObjList : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 4103 on class class org.apache.poi.hslf.record.ExMCIMovie : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 4101 on class class org.apache.poi.hslf.record.ExVideoContainer : java.lang.reflect.InvocationTargetException Cause was : java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.poi.hslf.record.Record.createRecordForType:185 at org.apache.poi.hslf.record.Record.findChildRecords:128 at org.apache.poi.hslf.record.Document.:133 at sun.reflect.GeneratedConstructorAccessor45.newInstance:-1 at sun.reflect.DelegatingConstructorAccessorImpl.newInstance:-1 at java.lang.reflect.Constructor.newInstance:-1 at org.apache.poi.hslf.record.Record.createRecordForType:181 at org.apache.poi.hslf.record.Record.buildRecordAtOffset:103 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read:276 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords:257 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.:161 at org.apache.poi.hslf.usermodel.HSLFSlideShow.:154 at org.apache.tika.parser.microsoft.HSLFExtractor.parse:65 at org.apache.tika.parser.microsoft.OfficeParser.parse:172 at
[jira] [Updated] (TIKA-2217) RuntimeException on a PPT with a movie
[ https://issues.apache.org/jira/browse/TIKA-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2217: - Description: https://dl.dropboxusercontent.com/u/92341073/lecture%20WH%202002.ppt java.lang.RuntimeException: Couldn't instantiate the class for type with id 1000 on class class org.apache.poi.hslf.record.Document : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 1033 on class class org.apache.poi.hslf.record.ExObjList : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 4103 on class class org.apache.poi.hslf.record.ExMCIMovie : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 4101 on class class org.apache.poi.hslf.record.ExVideoContainer : java.lang.reflect.InvocationTargetException Cause was : java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.poi.hslf.record.Record.createRecordForType:185 at org.apache.poi.hslf.record.Record.buildRecordAtOffset:103 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read:276 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords:257 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.:161 at org.apache.poi.hslf.usermodel.HSLFSlideShow.:154 at org.apache.tika.parser.microsoft.HSLFExtractor.parse:65 at org.apache.tika.parser.microsoft.OfficeParser.parse:172 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 Caused by: java.lang.reflect.InvocationTargetException: at sun.reflect.GeneratedConstructorAccessor45.newInstance:-1 at sun.reflect.DelegatingConstructorAccessorImpl.newInstance:-1 at java.lang.reflect.Constructor.newInstance:-1 at org.apache.poi.hslf.record.Record.createRecordForType:181 at org.apache.poi.hslf.record.Record.buildRecordAtOffset:103 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read:276 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords:257 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.:161 at org.apache.poi.hslf.usermodel.HSLFSlideShow.:154 at org.apache.tika.parser.microsoft.HSLFExtractor.parse:65 at org.apache.tika.parser.microsoft.OfficeParser.parse:172 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 Caused by: java.lang.RuntimeException: Couldn't instantiate the class for type with id 1033 on class class org.apache.poi.hslf.record.ExObjList : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 4103 on class class org.apache.poi.hslf.record.ExMCIMovie : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 4101 on class class org.apache.poi.hslf.record.ExVideoContainer : java.lang.reflect.InvocationTargetException Cause was : java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.poi.hslf.record.Record.createRecordForType:185 at org.apache.poi.hslf.record.Record.findChildRecords:128 at org.apache.poi.hslf.record.Document.:133 at sun.reflect.GeneratedConstructorAccessor45.newInstance:-1 at sun.reflect.DelegatingConstructorAccessorImpl.newInstance:-1 at java.lang.reflect.Constructor.newInstance:-1 at org.apache.poi.hslf.record.Record.createRecordForType:181 at org.apache.poi.hslf.record.Record.buildRecordAtOffset:103 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read:276 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords:257 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.:161 at org.apache.poi.hslf.usermodel.HSLFSlideShow.:154 at org.apache.tika.parser.microsoft.HSLFExtractor.parse:65 at org.apache.tika.parser.microsoft.OfficeParser.parse:172 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 Caused by: java.lang.reflect.InvocationTargetException: at sun.reflect.GeneratedConstructorAccessor47.newInstance:-1 at sun.reflect.DelegatingConstructorAccessorImpl.newInstance:-1 at java.lang.reflect.Constructor.newInstance:-1 at org.apache.poi.hslf.record.Record.createRecordForType:181 at org.apache.poi.hslf.record.Record.findChildRecords:128 at org.apache.poi.hslf.record.Document.:133 at sun.reflect.GeneratedConstructorAccessor45.newInstance:-1 at sun.reflect.DelegatingConstructorAccessorImpl.newInstance:-1 at java.lang.reflect.Constructor.newInstance:-1 at org.apache.poi.hslf.record.Record.createRecordForType:181 at org.apache.poi.hslf.record.Record.buildRecordAtOffset:103 at
[jira] [Updated] (TIKA-2216) ArrayIndexOutOfBoundsException on a valid Word file
[ https://issues.apache.org/jira/browse/TIKA-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2216: - Description: https://dl.dropboxusercontent.com/u/92341073/lecture%20WH%202002.ppt java.lang.ArrayIndexOutOfBoundsException: at org.apache.poi.hwpf.sprm.SprmBuffer.append:128 at org.apache.poi.hwpf.model.PAPBinTable.rebuild:269 at org.apache.poi.hwpf.model.PAPBinTable.rebuild:101 at org.apache.poi.hwpf.HWPFOldDocument.:132 at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642 at org.apache.tika.parser.microsoft.WordExtractor.parse:153 at org.apache.tika.parser.microsoft.OfficeParser.parse:169 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 was: java.lang.ArrayIndexOutOfBoundsException: at org.apache.poi.hwpf.sprm.SprmBuffer.append:128 at org.apache.poi.hwpf.model.PAPBinTable.rebuild:269 at org.apache.poi.hwpf.model.PAPBinTable.rebuild:101 at org.apache.poi.hwpf.HWPFOldDocument.:132 at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642 at org.apache.tika.parser.microsoft.WordExtractor.parse:153 at org.apache.tika.parser.microsoft.OfficeParser.parse:169 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 > ArrayIndexOutOfBoundsException on a valid Word file > --- > > Key: TIKA-2216 > URL: https://issues.apache.org/jira/browse/TIKA-2216 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: TB Coord RFCb.doc > > > https://dl.dropboxusercontent.com/u/92341073/lecture%20WH%202002.ppt > java.lang.ArrayIndexOutOfBoundsException: > at org.apache.poi.hwpf.sprm.SprmBuffer.append:128 > at org.apache.poi.hwpf.model.PAPBinTable.rebuild:269 > at org.apache.poi.hwpf.model.PAPBinTable.rebuild:101 > at org.apache.poi.hwpf.HWPFOldDocument.:132 > at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642 > at org.apache.tika.parser.microsoft.WordExtractor.parse:153 > at org.apache.tika.parser.microsoft.OfficeParser.parse:169 > at org.apache.tika.parser.microsoft.OfficeParser.parse:130 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2216) ArrayIndexOutOfBoundsException on a valid Word file
[ https://issues.apache.org/jira/browse/TIKA-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2216: - Description: java.lang.ArrayIndexOutOfBoundsException: at org.apache.poi.hwpf.sprm.SprmBuffer.append:128 at org.apache.poi.hwpf.model.PAPBinTable.rebuild:269 at org.apache.poi.hwpf.model.PAPBinTable.rebuild:101 at org.apache.poi.hwpf.HWPFOldDocument.:132 at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642 at org.apache.tika.parser.microsoft.WordExtractor.parse:153 at org.apache.tika.parser.microsoft.OfficeParser.parse:169 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 was: https://dl.dropboxusercontent.com/u/92341073/lecture%20WH%202002.ppt java.lang.ArrayIndexOutOfBoundsException: at org.apache.poi.hwpf.sprm.SprmBuffer.append:128 at org.apache.poi.hwpf.model.PAPBinTable.rebuild:269 at org.apache.poi.hwpf.model.PAPBinTable.rebuild:101 at org.apache.poi.hwpf.HWPFOldDocument.:132 at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642 at org.apache.tika.parser.microsoft.WordExtractor.parse:153 at org.apache.tika.parser.microsoft.OfficeParser.parse:169 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 > ArrayIndexOutOfBoundsException on a valid Word file > --- > > Key: TIKA-2216 > URL: https://issues.apache.org/jira/browse/TIKA-2216 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: TB Coord RFCb.doc > > > java.lang.ArrayIndexOutOfBoundsException: > at org.apache.poi.hwpf.sprm.SprmBuffer.append:128 > at org.apache.poi.hwpf.model.PAPBinTable.rebuild:269 > at org.apache.poi.hwpf.model.PAPBinTable.rebuild:101 > at org.apache.poi.hwpf.HWPFOldDocument.:132 > at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642 > at org.apache.tika.parser.microsoft.WordExtractor.parse:153 > at org.apache.tika.parser.microsoft.OfficeParser.parse:169 > at org.apache.tika.parser.microsoft.OfficeParser.parse:130 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2217) RuntimeException on a PPT with a movie
Seva Alekseyev created TIKA-2217: Summary: RuntimeException on a PPT with a movie Key: TIKA-2217 URL: https://issues.apache.org/jira/browse/TIKA-2217 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.14 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev java.lang.RuntimeException for 63933/<\\ai-storm\FScan\Scan_2016-12-16_01-06-55\Folders\75457622\lecture WH 2002.ppt>: "Couldn't instantiate the class for type with id 1000 on class class org.apache.poi.hslf.record.Document : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 1033 on class class org.apache.poi.hslf.record.ExObjList : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 4103 on class class org.apache.poi.hslf.record.ExMCIMovie : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 4101 on class class org.apache.poi.hslf.record.ExVideoContainer : java.lang.reflect.InvocationTargetException Cause was : java.lang.ArrayIndexOutOfBoundsException: 1" java.lang.RuntimeException: Couldn't instantiate the class for type with id 1000 on class class org.apache.poi.hslf.record.Document : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 1033 on class class org.apache.poi.hslf.record.ExObjList : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 4103 on class class org.apache.poi.hslf.record.ExMCIMovie : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 4101 on class class org.apache.poi.hslf.record.ExVideoContainer : java.lang.reflect.InvocationTargetException Cause was : java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.poi.hslf.record.Record.createRecordForType:185 at org.apache.poi.hslf.record.Record.buildRecordAtOffset:103 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read:276 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords:257 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.:161 at org.apache.poi.hslf.usermodel.HSLFSlideShow.:154 at org.apache.tika.parser.microsoft.HSLFExtractor.parse:65 at org.apache.tika.parser.microsoft.OfficeParser.parse:172 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 Caused by: java.lang.reflect.InvocationTargetException: at sun.reflect.GeneratedConstructorAccessor45.newInstance:-1 at sun.reflect.DelegatingConstructorAccessorImpl.newInstance:-1 at java.lang.reflect.Constructor.newInstance:-1 at org.apache.poi.hslf.record.Record.createRecordForType:181 at org.apache.poi.hslf.record.Record.buildRecordAtOffset:103 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read:276 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords:257 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.:161 at org.apache.poi.hslf.usermodel.HSLFSlideShow.:154 at org.apache.tika.parser.microsoft.HSLFExtractor.parse:65 at org.apache.tika.parser.microsoft.OfficeParser.parse:172 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 Caused by: java.lang.RuntimeException: Couldn't instantiate the class for type with id 1033 on class class org.apache.poi.hslf.record.ExObjList : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 4103 on class class org.apache.poi.hslf.record.ExMCIMovie : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 4101 on class class org.apache.poi.hslf.record.ExVideoContainer : java.lang.reflect.InvocationTargetException Cause was : java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.poi.hslf.record.Record.createRecordForType:185 at org.apache.poi.hslf.record.Record.findChildRecords:128 at org.apache.poi.hslf.record.Document.:133 at sun.reflect.GeneratedConstructorAccessor45.newInstance:-1 at sun.reflect.DelegatingConstructorAccessorImpl.newInstance:-1 at java.lang.reflect.Constructor.newInstance:-1 at org.apache.poi.hslf.record.Record.createRecordForType:181 at org.apache.poi.hslf.record.Record.buildRecordAtOffset:103 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read:276 at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords:257 at
[jira] [Updated] (TIKA-2216) ArrayIndexOutOfBoundsException on a valid Word file
[ https://issues.apache.org/jira/browse/TIKA-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2216: - Attachment: TB Coord RFCb.doc > ArrayIndexOutOfBoundsException on a valid Word file > --- > > Key: TIKA-2216 > URL: https://issues.apache.org/jira/browse/TIKA-2216 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: TB Coord RFCb.doc > > > java.lang.ArrayIndexOutOfBoundsException: > at org.apache.poi.hwpf.sprm.SprmBuffer.append:128 > at org.apache.poi.hwpf.model.PAPBinTable.rebuild:269 > at org.apache.poi.hwpf.model.PAPBinTable.rebuild:101 > at org.apache.poi.hwpf.HWPFOldDocument.:132 > at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642 > at org.apache.tika.parser.microsoft.WordExtractor.parse:153 > at org.apache.tika.parser.microsoft.OfficeParser.parse:169 > at org.apache.tika.parser.microsoft.OfficeParser.parse:130 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2216) ArrayIndexOutOfBoundsException on a valid Word file
Seva Alekseyev created TIKA-2216: Summary: ArrayIndexOutOfBoundsException on a valid Word file Key: TIKA-2216 URL: https://issues.apache.org/jira/browse/TIKA-2216 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.14 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev java.lang.ArrayIndexOutOfBoundsException: at org.apache.poi.hwpf.sprm.SprmBuffer.append:128 at org.apache.poi.hwpf.model.PAPBinTable.rebuild:269 at org.apache.poi.hwpf.model.PAPBinTable.rebuild:101 at org.apache.poi.hwpf.HWPFOldDocument.:132 at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642 at org.apache.tika.parser.microsoft.WordExtractor.parse:153 at org.apache.tika.parser.microsoft.OfficeParser.parse:169 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2215) TikaException about "Invalid embedded resource" on a valid PPT file
Seva Alekseyev created TIKA-2215: Summary: TikaException about "Invalid embedded resource" on a valid PPT file Key: TIKA-2215 URL: https://issues.apache.org/jira/browse/TIKA-2215 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.14 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev Attachments: Iverson.ppt On the attached file, which opens with PowerPoint, the Tika parser throws the following error: org.apache.tika.exception.TikaException: Invalid embedded resource at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc:243 at org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources:390 at org.apache.tika.parser.microsoft.HSLFExtractor.parse:142 at org.apache.tika.parser.microsoft.OfficeParser.parse:172 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 Caused by: java.lang.IndexOutOfBoundsException: Block 32630271 not found at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.getBlockAt:486 at org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next:169 at org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next:142 at org.apache.poi.poifs.filesystem.NDocumentInputStream.readFully:248 at org.apache.poi.poifs.filesystem.DocumentInputStream.readFully:165 at org.apache.poi.poifs.filesystem.DocumentInputStream.readFully:160 at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc:226 at org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources:390 at org.apache.tika.parser.microsoft.HSLFExtractor.parse:142 at org.apache.tika.parser.microsoft.OfficeParser.parse:172 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 Caused by: java.lang.IndexOutOfBoundsException: Unable to read 512 bytes from 16706699264 in stream of length 164352 at org.apache.poi.poifs.nio.ByteArrayBackedDataSource.read:42 at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.getBlockAt:484 at org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next:169 at org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next:142 at org.apache.poi.poifs.filesystem.NDocumentInputStream.readFully:248 at org.apache.poi.poifs.filesystem.DocumentInputStream.readFully:165 at org.apache.poi.poifs.filesystem.DocumentInputStream.readFully:160 at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc:226 at org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources:390 at org.apache.tika.parser.microsoft.HSLFExtractor.parse:142 at org.apache.tika.parser.microsoft.OfficeParser.parse:172 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2215) TikaException about "Invalid embedded resource" on a valid PPT file
[ https://issues.apache.org/jira/browse/TIKA-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2215: - Attachment: Iverson.ppt > TikaException about "Invalid embedded resource" on a valid PPT file > --- > > Key: TIKA-2215 > URL: https://issues.apache.org/jira/browse/TIKA-2215 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: Iverson.ppt > > > On the attached file, which opens with PowerPoint, the Tika parser throws the > following error: > org.apache.tika.exception.TikaException: Invalid embedded resource > at > org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc:243 > at > org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources:390 > at org.apache.tika.parser.microsoft.HSLFExtractor.parse:142 > at org.apache.tika.parser.microsoft.OfficeParser.parse:172 > at org.apache.tika.parser.microsoft.OfficeParser.parse:130 > Caused by: java.lang.IndexOutOfBoundsException: Block 32630271 not found > at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.getBlockAt:486 > at > org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next:169 > at > org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next:142 > at org.apache.poi.poifs.filesystem.NDocumentInputStream.readFully:248 > at org.apache.poi.poifs.filesystem.DocumentInputStream.readFully:165 > at org.apache.poi.poifs.filesystem.DocumentInputStream.readFully:160 > at > org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc:226 > at > org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources:390 > at org.apache.tika.parser.microsoft.HSLFExtractor.parse:142 > at org.apache.tika.parser.microsoft.OfficeParser.parse:172 > at org.apache.tika.parser.microsoft.OfficeParser.parse:130 > Caused by: java.lang.IndexOutOfBoundsException: Unable to read 512 bytes from > 16706699264 in stream of length 164352 > at org.apache.poi.poifs.nio.ByteArrayBackedDataSource.read:42 > at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.getBlockAt:484 > at > org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next:169 > at > org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next:142 > at org.apache.poi.poifs.filesystem.NDocumentInputStream.readFully:248 > at org.apache.poi.poifs.filesystem.DocumentInputStream.readFully:165 > at org.apache.poi.poifs.filesystem.DocumentInputStream.readFully:160 > at > org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc:226 > at > org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources:390 > at org.apache.tika.parser.microsoft.HSLFExtractor.parse:142 > at org.apache.tika.parser.microsoft.OfficeParser.parse:172 > at org.apache.tika.parser.microsoft.OfficeParser.parse:130 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2214) ArrayIndexOutOfBoundsException on a valid Word file
Seva Alekseyev created TIKA-2214: Summary: ArrayIndexOutOfBoundsException on a valid Word file Key: TIKA-2214 URL: https://issues.apache.org/jira/browse/TIKA-2214 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.14 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev Attachments: NONCONT.DOC On the attached file, which opens with Word, the Tika parser throws the following error: java.lang.ArrayIndexOutOfBoundsException: at java.lang.System.arraycopy:-2 at org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl:171 at org.apache.poi.hwpf.model.PAPFormattedDiskPage.:101 at org.apache.poi.hwpf.model.OldPAPBinTable.:49 at org.apache.poi.hwpf.HWPFOldDocument.:105 at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642 at org.apache.tika.parser.microsoft.WordExtractor.parse:153 at org.apache.tika.parser.microsoft.OfficeParser.parse:169 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2214) ArrayIndexOutOfBoundsException on a valid Word file
[ https://issues.apache.org/jira/browse/TIKA-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2214: - Attachment: NONCONT.DOC > ArrayIndexOutOfBoundsException on a valid Word file > --- > > Key: TIKA-2214 > URL: https://issues.apache.org/jira/browse/TIKA-2214 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: NONCONT.DOC > > > On the attached file, which opens with Word, the Tika parser throws the > following error: > java.lang.ArrayIndexOutOfBoundsException: > at java.lang.System.arraycopy:-2 > at org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl:171 > at org.apache.poi.hwpf.model.PAPFormattedDiskPage.:101 > at org.apache.poi.hwpf.model.OldPAPBinTable.:49 > at org.apache.poi.hwpf.HWPFOldDocument.:105 > at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642 > at org.apache.tika.parser.microsoft.WordExtractor.parse:153 > at org.apache.tika.parser.microsoft.OfficeParser.parse:169 > at org.apache.tika.parser.microsoft.OfficeParser.parse:130 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2213) ArrayIndexOutOfBoundsException on a valid Word file
[ https://issues.apache.org/jira/browse/TIKA-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2213: - Attachment: biennial - 96.doc > ArrayIndexOutOfBoundsException on a valid Word file > --- > > Key: TIKA-2213 > URL: https://issues.apache.org/jira/browse/TIKA-2213 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: biennial - 96.doc > > > On the attached file, which opens in Word, Tika parser throws the following > error: > java.lang.ArrayIndexOutOfBoundsException: > at java.lang.System.arraycopy:-2 > at org.apache.poi.hwpf.model.TextPieceTable.:109 > at org.apache.poi.hwpf.model.ComplexFileTable.:70 > at org.apache.poi.hwpf.HWPFOldDocument.:68 > at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642 > at org.apache.tika.parser.microsoft.WordExtractor.parse:153 > at org.apache.tika.parser.microsoft.OfficeParser.parse:169 > at org.apache.tika.parser.microsoft.OfficeParser.parse:130 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2213) ArrayIndexOutOfBoundsException on a valid Word file
Seva Alekseyev created TIKA-2213: Summary: ArrayIndexOutOfBoundsException on a valid Word file Key: TIKA-2213 URL: https://issues.apache.org/jira/browse/TIKA-2213 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.14 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev On the attached file, which opens in Word, Tika parser throws the following error: java.lang.ArrayIndexOutOfBoundsException: at java.lang.System.arraycopy:-2 at org.apache.poi.hwpf.model.TextPieceTable.:109 at org.apache.poi.hwpf.model.ComplexFileTable.:70 at org.apache.poi.hwpf.HWPFOldDocument.:68 at org.apache.tika.parser.microsoft.WordExtractor.parseWord6:642 at org.apache.tika.parser.microsoft.WordExtractor.parse:153 at org.apache.tika.parser.microsoft.OfficeParser.parse:169 at org.apache.tika.parser.microsoft.OfficeParser.parse:130 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-2094) Error parsing .doc file with visio embed
[ https://issues.apache.org/jira/browse/TIKA-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2094. --- Resolution: Not A Problem Not a problem anymore > Error parsing .doc file with visio embed > > > Key: TIKA-2094 > URL: https://issues.apache.org/jira/browse/TIKA-2094 > Project: Tika > Issue Type: Bug >Affects Versions: 1.13 > Environment: JDK7 >Reporter: wangruochan > Attachments: testtika.doc, testtika.doc > > > when I try to parse a .doc file with a visio embeb,an exception occurred, > Print the stacktrace below: > Exception in thread "main" java.lang.NoClassDefFoundError: > com/microsoft/schemas/office/visio/x2012/main/ConnectsType > at > com.microsoft.schemas.office.visio.x2012.main.impl.PageContentsTypeImpl.getConnects(Unknown > Source) > at > org.apache.poi.xdgf.usermodel.XDGFBaseContents.onDocumentRead(XDGFBaseContents.java:89) > at > org.apache.poi.xdgf.usermodel.XDGFPageContents.onDocumentRead(XDGFPageContents.java:73) > at > org.apache.poi.xdgf.usermodel.XDGFPages.onDocumentRead(XDGFPages.java:94) > at > org.apache.poi.xdgf.usermodel.XmlVisioDocument.onDocumentRead(XmlVisioDocument.java:108) > at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160) > at > org.apache.poi.xdgf.usermodel.XmlVisioDocument.(XmlVisioDocument.java:79) > at > org.apache.poi.xdgf.extractor.XDGFVisioExtractor.(XDGFVisioExtractor.java:41) > at > org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:212) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at > org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) > at > org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) > at > org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:140) > at > org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:164) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:208) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at test.apache.tika.Test.main(Test.java:29) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147) > Caused by: java.lang.ClassNotFoundException: > com.microsoft.schemas.office.visio.x2012.main.ConnectsType > at java.net.URLClassLoader$1.run(URLClassLoader.java:366) > at java.net.URLClassLoader$1.run(URLClassLoader.java:355) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:354) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > ... 30 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-2107) Old MS Word files give error while indexing
[ https://issues.apache.org/jira/browse/TIKA-2107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2107. --- Resolution: Won't Fix Sorry, POI/Tika don't currently support such old Word files. If anyone contributes a parser to POI, we'll be sure to include it in Tika. Closing as 'won't fix' for now. Please reopen if a Java, Apache 2.0-compatible parser is available. > Old MS Word files give error while indexing > --- > > Key: TIKA-2107 > URL: https://issues.apache.org/jira/browse/TIKA-2107 > Project: Tika > Issue Type: Bug > Components: tika-batch >Affects Versions: 1.8, 2.0 > Environment: ubuntu >Reporter: Gaurav > Labels: patch > Attachments: Tika 2.0 error.jpg, plen281.doc > > > error while indexing old MS word files > Screen shot of Tika 2.0 attached. > Error with Tika 1.8: > Log of Tika 1.8: > INFO: meta (application/msword) > Oct 04, 2016 6:42:30 PM org.apache.tika.server.resource.TikaResource parse > WARNING: meta: Text extraction failed > org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from > org.apache.tika.parser.microsoft.OfficeParser@7260e439 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:287) > at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at > org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:238) > at > org.apache.tika.server.resource.MetadataResource.parseMetadata(MetadataResource.java:134) > at > org.apache.tika.server.resource.MetadataResource.getMetadata(MetadataResource.java:67) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:181) > at > org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:97) > at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:200) > at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:99) > at > org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59) > at > org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96) > at > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) > at > org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) > at > org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251) > at > org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261) > at > org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) > at org.eclipse.jetty.server.Server.handle(Server.java:370) > at > org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) > at > org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982) > at > org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043) > at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865) > at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240) > at > org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82) > at > org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696) > at > org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) > at java.lang.Thread.run(Thread.java:745) > Caused by:
[jira] [Updated] (TIKA-2212) Update mimes for OOXMLParser
[ https://issues.apache.org/jira/browse/TIKA-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2212: -- Description: On TIKA-2208, [~dadoonet] found that we are missing the mime for .potm files in our OOXMLParser. Let's add it. I also found that it was not possible to exclude children or grandchildren of "x-tika-ooxml". We should fix that somehow. was:On TIKA-2208, [~dadoonet] found that we are missing the mime for .potm files in our OOXMLParser. Let's add it. > Update mimes for OOXMLParser > > > Key: TIKA-2212 > URL: https://issues.apache.org/jira/browse/TIKA-2212 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Trivial > > On TIKA-2208, [~dadoonet] found that we are missing the mime for .potm files > in our OOXMLParser. Let's add it. > I also found that it was not possible to exclude children or grandchildren of > "x-tika-ooxml". We should fix that somehow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2212) Update mimes for OOXMLParser
[ https://issues.apache.org/jira/browse/TIKA-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15761132#comment-15761132 ] Tim Allison commented on TIKA-2212: --- So that users can control includes/excludes with greater precision, I propose adding: {noformat} application/vnd.ms-powerpoint.slide.macroenabled.12 application/vnd.ms-powerpoint.template.macroenabled.12 application/vnd.openxmlformats-officedocument.presentationml.slide application/vnd.ms-visio.drawing application/vnd.ms-visio.drawing.macroenabled.12 application/vnd.ms-visio.stencil application/vnd.ms-visio.stencil.macroenabled.12 application/vnd.ms-visio.template application/vnd.ms-visio.template.macroenabled.12 model/vnd.dwfx+xps {noformat} and removing {{x-tika-ooxml}} from OOXMLParser's {{SUPPORTED_TYPES}}. Some questions: 1) Does the OOXMLParser actually support all of these or should some be moved to the {{UNSUPPORTED_OOXML_TYPES}}? 2) Any objections to this proposal? (ping [~gagravarr]) > Update mimes for OOXMLParser > > > Key: TIKA-2212 > URL: https://issues.apache.org/jira/browse/TIKA-2212 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Trivial > > On TIKA-2208, [~dadoonet] found that we are missing the mime for .potm files > in our OOXMLParser. Let's add it. > I also found that it was not possible to exclude children or grandchildren of > "x-tika-ooxml". We should fix that somehow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-2212) Update mimes for OOXMLParser
[ https://issues.apache.org/jira/browse/TIKA-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15761116#comment-15761116 ] Tim Allison edited comment on TIKA-2212 at 12/19/16 1:15 PM: - If we run this: {noformat} TikaConfig tikaConfig = TikaConfig.getDefaultConfig(); MediaTypeRegistry registry = tikaConfig.getMediaTypeRegistry(); for (MediaType child : registry.getChildTypes(MediaType.application("x-tika-ooxml"))) { //System.out.println(child); if (! OOXMLParser.SUPPORTED_TYPES.contains(child) && ! OOXMLParser.UNSUPPORTED_OOXML_TYPES.contains(child)) { System.out.println("Falling between the cracks: " + child); } for (MediaType grandchild : registry.getChildTypes(child)) { if (! OOXMLParser.SUPPORTED_TYPES.contains(grandchild) && ! OOXMLParser.UNSUPPORTED_OOXML_TYPES.contains(grandchild)) { System.out.println("Falling between the cracks grandchild: " + grandchild); } } } {noformat} We get this: {noformat} Falling between the cracks: application/vnd.ms-powerpoint.slide.macroenabled.12 Falling between the cracks: application/vnd.ms-powerpoint.template.macroenabled.12 Falling between the cracks: application/vnd.openxmlformats-officedocument.presentationml.slide Falling between the cracks: application/x-tika-ooxml-protected Falling between the cracks: application/x-tika-visio-ooxml Falling between the cracks grandchild: application/vnd.ms-visio.drawing Falling between the cracks grandchild: application/vnd.ms-visio.drawing.macroenabled.12 Falling between the cracks grandchild: application/vnd.ms-visio.stencil Falling between the cracks grandchild: application/vnd.ms-visio.stencil.macroenabled.12 Falling between the cracks grandchild: application/vnd.ms-visio.template Falling between the cracks grandchild: application/vnd.ms-visio.template.macroenabled.12 Falling between the cracks: model/vnd.dwfx+xps {noformat} was (Author: talli...@mitre.org): If we run this: {noformat} TikaConfig tikaConfig = TikaConfig.getDefaultConfig(); MediaTypeRegistry registry = tikaConfig.getMediaTypeRegistry(); for (MediaType child : registry.getChildTypes(MediaType.application("x-tika-ooxml"))) { //System.out.println(child); if (! OOXMLParser.SUPPORTED_TYPES.contains(child) && ! OOXMLParser.UNSUPPORTED_OOXML_TYPES.contains(child)) { System.out.println("Falling between the cracks: "+child); } for (MediaType grandchild : registry.getChildTypes(child)) { if (! OOXMLParser.SUPPORTED_TYPES.contains(child) && ! OOXMLParser.UNSUPPORTED_OOXML_TYPES.contains(child)) { System.out.println("Falling between the cracks grandchild: "+grandchild); } } } {noformat} We get this: {noformat} Falling between the cracks: application/vnd.ms-powerpoint.slide.macroenabled.12 Falling between the cracks: application/vnd.ms-powerpoint.template.macroenabled.12 Falling between the cracks: application/vnd.openxmlformats-officedocument.presentationml.slide Falling between the cracks: application/x-tika-ooxml-protected Falling between the cracks: application/x-tika-visio-ooxml Falling between the cracks grandchild: application/vnd.ms-visio.drawing Falling between the cracks grandchild: application/vnd.ms-visio.drawing.macroenabled.12 Falling between the cracks grandchild: application/vnd.ms-visio.stencil Falling between the cracks grandchild: application/vnd.ms-visio.stencil.macroenabled.12 Falling between the cracks grandchild: application/vnd.ms-visio.template Falling between the cracks grandchild: application/vnd.ms-visio.template.macroenabled.12 Falling between the cracks: model/vnd.dwfx+xps {noformat} > Update mimes for OOXMLParser > > > Key: TIKA-2212 > URL: https://issues.apache.org/jira/browse/TIKA-2212 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Trivial > > On TIKA-2208, [~dadoonet] found that we are missing the mime for .potm files > in our OOXMLParser. Let's add it. > I also found that it was not possible to exclude children or grandchildren of > "x-tika-ooxml". We should fix that somehow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2212) Update mimes for OOXMLParser
[ https://issues.apache.org/jira/browse/TIKA-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2212: -- Summary: Update mimes for OOXMLParser (was: Add mime for .potm to OOXMLParser) > Update mimes for OOXMLParser > > > Key: TIKA-2212 > URL: https://issues.apache.org/jira/browse/TIKA-2212 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Trivial > > On TIKA-2208, [~dadoonet] found that we are missing the mime for .potm files > in our OOXMLParser. Let's add it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2212) Add mime for .potm to OOXMLParser
[ https://issues.apache.org/jira/browse/TIKA-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15761116#comment-15761116 ] Tim Allison commented on TIKA-2212: --- If we run this: {noformat} TikaConfig tikaConfig = TikaConfig.getDefaultConfig(); MediaTypeRegistry registry = tikaConfig.getMediaTypeRegistry(); for (MediaType child : registry.getChildTypes(MediaType.application("x-tika-ooxml"))) { //System.out.println(child); if (! OOXMLParser.SUPPORTED_TYPES.contains(child) && ! OOXMLParser.UNSUPPORTED_OOXML_TYPES.contains(child)) { System.out.println("Falling between the cracks: "+child); } for (MediaType grandchild : registry.getChildTypes(child)) { if (! OOXMLParser.SUPPORTED_TYPES.contains(child) && ! OOXMLParser.UNSUPPORTED_OOXML_TYPES.contains(child)) { System.out.println("Falling between the cracks grandchild: "+grandchild); } } } {noformat} We get this: {noformat} Falling between the cracks: application/vnd.ms-powerpoint.slide.macroenabled.12 Falling between the cracks: application/vnd.ms-powerpoint.template.macroenabled.12 Falling between the cracks: application/vnd.openxmlformats-officedocument.presentationml.slide Falling between the cracks: application/x-tika-ooxml-protected Falling between the cracks: application/x-tika-visio-ooxml Falling between the cracks grandchild: application/vnd.ms-visio.drawing Falling between the cracks grandchild: application/vnd.ms-visio.drawing.macroenabled.12 Falling between the cracks grandchild: application/vnd.ms-visio.stencil Falling between the cracks grandchild: application/vnd.ms-visio.stencil.macroenabled.12 Falling between the cracks grandchild: application/vnd.ms-visio.template Falling between the cracks grandchild: application/vnd.ms-visio.template.macroenabled.12 Falling between the cracks: model/vnd.dwfx+xps {noformat} > Add mime for .potm to OOXMLParser > - > > Key: TIKA-2212 > URL: https://issues.apache.org/jira/browse/TIKA-2212 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Trivial > > On TIKA-2208, [~dadoonet] found that we are missing the mime for .potm files > in our OOXMLParser. Let's add it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2212) Add mime for .potm to OOXMLParser
Tim Allison created TIKA-2212: - Summary: Add mime for .potm to OOXMLParser Key: TIKA-2212 URL: https://issues.apache.org/jira/browse/TIKA-2212 Project: Tika Issue Type: Bug Reporter: Tim Allison Priority: Trivial On TIKA-2208, [~dadoonet] found that we are missing the mime for .potm files in our OOXMLParser. Let's add it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-2208) Catch missing libraires
[ https://issues.apache.org/jira/browse/TIKA-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15761055#comment-15761055 ] Tim Allison edited comment on TIKA-2208 at 12/19/16 12:54 PM: -- Three cheers for unit tests! It looks like we need to add vnd.ms-powerpoint.template.macroenabled.12 to OOXMLParser's handled media types. I'll make that change shortly. Meanwhile, you could try something like this, which runs against nearly all of our test documents: {noformat} private static final Set INCLUDES = new HashSet<>(); static { for (MediaType mediaType : OOXMLParser.SUPPORTED_TYPES) { if (mediaType.equals(MediaType.application("x-tika-ooxml"))) { continue; } INCLUDES.add(mediaType); } INCLUDES.add(MediaType.application("vnd.ms-powerpoint.template.macroenabled.12")); } private static final Set EXCLUDES = Collections.unmodifiableSet(new HashSet<>(Arrays.asList( MediaType.application("x-tika-ooxml") ))); private static final Parser DECORATED_PARSERS[] = new Parser[] { // documents new org.apache.tika.parser.html.HtmlParser(), new org.apache.tika.parser.rtf.RTFParser(), new org.apache.tika.parser.pdf.PDFParser(), new org.apache.tika.parser.txt.TXTParser(), new org.apache.tika.parser.microsoft.OfficeParser(), new org.apache.tika.parser.microsoft.OldExcelParser(), ParserDecorator.withTypes( ParserDecorator.withoutTypes( new org.apache.tika.parser.microsoft.ooxml.OOXMLParser(), EXCLUDES ), INCLUDES), new org.apache.tika.parser.odf.OpenDocumentParser(), new org.apache.tika.parser.iwork.IWorkPackageParser(), new org.apache.tika.parser.xml.DcXMLParser(), new org.apache.tika.parser.epub.EpubParser(), }; private static final Parser STANDARD_PARSERS[] = new Parser[] { // documents new org.apache.tika.parser.html.HtmlParser(), new org.apache.tika.parser.rtf.RTFParser(), new org.apache.tika.parser.pdf.PDFParser(), new org.apache.tika.parser.txt.TXTParser(), new org.apache.tika.parser.microsoft.OfficeParser(), new org.apache.tika.parser.microsoft.OldExcelParser(), new org.apache.tika.parser.microsoft.ooxml.OOXMLParser(), new org.apache.tika.parser.odf.OpenDocumentParser(), new org.apache.tika.parser.iwork.IWorkPackageParser(), new org.apache.tika.parser.xml.DcXMLParser(), new org.apache.tika.parser.epub.EpubParser(), }; private static final AutoDetectParser DECORATED_PARSER_INSTANCE = new AutoDetectParser(DECORATED_PARSERS); private static final AutoDetectParser STANDARD_PARSER_INSTANCE = new AutoDetectParser(STANDARD_PARSERS); private static final Tika DECORATED_TIKA = new Tika(DECORATED_PARSER_INSTANCE.getDetector(), DECORATED_PARSER_INSTANCE); private static final Tika STANDARD_TIKA = new Tika(STANDARD_PARSER_INSTANCE.getDetector(), STANDARD_PARSER_INSTANCE); @Test public void testSkipVisioOOXML() throws Exception { for (File f : getResourceAsFile("/test-documents").listFiles()) { if (f.isDirectory()) { continue; } if (f.getName().contains("VISIO") && (f.getName().endsWith("x") || f.getName().endsWith("m"))) { continue; } if (f.getName().contains("embeddedVsdx")) { continue; } boolean decoratedEx = false; boolean standardEx = false; String decoratedOutput = ""; String standardOutput = ""; try (InputStream is = TikaInputStream.get(f)) { decoratedOutput = DECORATED_TIKA.parseToString(is); } catch (Throwable e) { decoratedEx = true; } try (InputStream is = TikaInputStream.get(f)) { standardOutput = STANDARD_TIKA.parseToString(is); } catch (Throwable e) { standardEx = true; } assertEquals(f.getName(), standardEx, decoratedEx); if (standardEx == false) { assertEquals(f.getName(), standardOutput, decoratedOutput); } } } {noformat} was (Author: talli...@mitre.org): Three cheers for unit tests! It looks like we need to add vnd.ms-powerpoint.template.macroenabled.12 to OOXMLParser's handled media types. I'll make that change shortly. Meanwhile, you could try something like this, which runs against nearly all of our test documents: {noformat} private static final Set INCLUDES = new HashSet<>(); static {
[jira] [Commented] (TIKA-2208) Catch missing libraires
[ https://issues.apache.org/jira/browse/TIKA-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15761085#comment-15761085 ] Tim Allison commented on TIKA-2208: --- Ugh. Rather than including the two clashing subsets of the ooxml-schemas, you could include the full [ooxml-schemas|https://mvnrepository.com/artifact/org.apache.poi/ooxml-schemas/1.3]. That does weigh in at 15MB, but it includes everything. > Catch missing libraires > --- > > Key: TIKA-2208 > URL: https://issues.apache.org/jira/browse/TIKA-2208 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: David Pilato > > Hi there > We have decided to remove support for some formats when using Tika to extract > text and metadata. > We defined our list of Parsers: > {code:java} > private static final Parser PARSERS[] = new Parser[] { > // documents > new org.apache.tika.parser.html.HtmlParser(), > new org.apache.tika.parser.rtf.RTFParser(), > new org.apache.tika.parser.pdf.PDFParser(), > new org.apache.tika.parser.txt.TXTParser(), > new org.apache.tika.parser.microsoft.OfficeParser(), > new org.apache.tika.parser.microsoft.OldExcelParser(), > new org.apache.tika.parser.microsoft.ooxml.OOXMLParser(), > new org.apache.tika.parser.odf.OpenDocumentParser(), > new org.apache.tika.parser.iwork.IWorkPackageParser(), > new org.apache.tika.parser.xml.DcXMLParser(), > new org.apache.tika.parser.epub.EpubParser(), > }; > private static final AutoDetectParser PARSER_INSTANCE = new > AutoDetectParser(PARSERS); > private static final Tika TIKA_INSTANCE = new > Tika(PARSER_INSTANCE.getDetector(), PARSER_INSTANCE); > {code} > But when a MS Office Word document embeds another non supported document > (Like a Visio Schema) an {{NoClassDefFoundError}} is raised. > Would it be possible to catch such a case and throw in that case a > {{TikaException}} so it behaves as an Exception and not as a Throwable? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2208) Catch missing libraires
[ https://issues.apache.org/jira/browse/TIKA-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15761071#comment-15761071 ] Tim Allison commented on TIKA-2208: --- Note, too, that the test also passes if you add: {noformat} if (f.getName().equals("testPPT.potm")) { assertContains("Watershed", decoratedOutput); } {noformat} :) > Catch missing libraires > --- > > Key: TIKA-2208 > URL: https://issues.apache.org/jira/browse/TIKA-2208 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: David Pilato > > Hi there > We have decided to remove support for some formats when using Tika to extract > text and metadata. > We defined our list of Parsers: > {code:java} > private static final Parser PARSERS[] = new Parser[] { > // documents > new org.apache.tika.parser.html.HtmlParser(), > new org.apache.tika.parser.rtf.RTFParser(), > new org.apache.tika.parser.pdf.PDFParser(), > new org.apache.tika.parser.txt.TXTParser(), > new org.apache.tika.parser.microsoft.OfficeParser(), > new org.apache.tika.parser.microsoft.OldExcelParser(), > new org.apache.tika.parser.microsoft.ooxml.OOXMLParser(), > new org.apache.tika.parser.odf.OpenDocumentParser(), > new org.apache.tika.parser.iwork.IWorkPackageParser(), > new org.apache.tika.parser.xml.DcXMLParser(), > new org.apache.tika.parser.epub.EpubParser(), > }; > private static final AutoDetectParser PARSER_INSTANCE = new > AutoDetectParser(PARSERS); > private static final Tika TIKA_INSTANCE = new > Tika(PARSER_INSTANCE.getDetector(), PARSER_INSTANCE); > {code} > But when a MS Office Word document embeds another non supported document > (Like a Visio Schema) an {{NoClassDefFoundError}} is raised. > Would it be possible to catch such a case and throw in that case a > {{TikaException}} so it behaves as an Exception and not as a Throwable? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-2208) Catch missing libraires
[ https://issues.apache.org/jira/browse/TIKA-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15761055#comment-15761055 ] Tim Allison edited comment on TIKA-2208 at 12/19/16 12:40 PM: -- Three cheers for unit tests! It looks like we need to add vnd.ms-powerpoint.template.macroenabled.12 to OOXMLParser's handled media types. I'll make that change shortly. Meanwhile, you could try something like this, which runs against nearly all of our test documents: {noformat} private static final Set INCLUDES = new HashSet<>(); static { for (MediaType mediaType : OOXMLParser.SUPPORTED_TYPES) { if (mediaType.equals(MediaType.application("x-tika-ooxml"))) { continue; } INCLUDES.add(mediaType); } INCLUDES.add(MediaType.application("vnd.ms-powerpoint.template.macroenabled.12")); } private static final Set EXCLUDES = Collections.unmodifiableSet(new HashSet<>(Arrays.asList( MediaType.application("x-tika-ooxml") ))); private static final Parser DECORATED_PARSERS[] = new Parser[] { // documents new org.apache.tika.parser.html.HtmlParser(), new org.apache.tika.parser.rtf.RTFParser(), new org.apache.tika.parser.pdf.PDFParser(), new org.apache.tika.parser.txt.TXTParser(), new org.apache.tika.parser.microsoft.OfficeParser(), new org.apache.tika.parser.microsoft.OldExcelParser(), ParserDecorator.withTypes(ParserDecorator.withoutTypes( new org.apache.tika.parser.microsoft.ooxml.OOXMLParser(), EXCLUDES ), INCLUDES), new org.apache.tika.parser.odf.OpenDocumentParser(), new org.apache.tika.parser.iwork.IWorkPackageParser(), new org.apache.tika.parser.xml.DcXMLParser(), new org.apache.tika.parser.epub.EpubParser(), }; private static final Parser STANDARD_PARSERS[] = new Parser[] { // documents new org.apache.tika.parser.html.HtmlParser(), new org.apache.tika.parser.rtf.RTFParser(), new org.apache.tika.parser.pdf.PDFParser(), new org.apache.tika.parser.txt.TXTParser(), new org.apache.tika.parser.microsoft.OfficeParser(), new org.apache.tika.parser.microsoft.OldExcelParser(), new org.apache.tika.parser.microsoft.ooxml.OOXMLParser(), new org.apache.tika.parser.odf.OpenDocumentParser(), new org.apache.tika.parser.iwork.IWorkPackageParser(), new org.apache.tika.parser.xml.DcXMLParser(), new org.apache.tika.parser.epub.EpubParser(), }; private static final AutoDetectParser DECORATED_PARSER_INSTANCE = new AutoDetectParser(DECORATED_PARSERS); private static final AutoDetectParser STANDARD_PARSER_INSTANCE = new AutoDetectParser(STANDARD_PARSERS); private static final Tika DECORATED_TIKA = new Tika(DECORATED_PARSER_INSTANCE.getDetector(), DECORATED_PARSER_INSTANCE); private static final Tika STANDARD_TIKA = new Tika(STANDARD_PARSER_INSTANCE.getDetector(), STANDARD_PARSER_INSTANCE); @Test public void testSkipVisioOOXML() throws Exception { for (File f : getResourceAsFile("/test-documents").listFiles()) { if (f.isDirectory()) { continue; } if (f.getName().contains("VISIO") && (f.getName().endsWith("x") || f.getName().endsWith("m"))) { continue; } if (f.getName().contains("embeddedVsdx")) { continue; } boolean decoratedEx = false; boolean standardEx = false; String decoratedOutput = ""; String standardOutput = ""; try (InputStream is = TikaInputStream.get(f)) { decoratedOutput = DECORATED_TIKA.parseToString(is); } catch (Throwable e) { decoratedEx = true; } try (InputStream is = TikaInputStream.get(f)) { standardOutput = STANDARD_TIKA.parseToString(is); } catch (Throwable e) { standardEx = true; } assertEquals(f.getName(), standardEx, decoratedEx); if (standardEx == false) { assertEquals(f.getName(), standardOutput, decoratedOutput); } } } {noformat} was (Author: talli...@mitre.org): Three cheers for unit tests! It looks like we need to add vnd.ms-powerpoint.template.macroenabled.12 to OOXMLParser's handled media types. I'll make that change shortly. Meanwhile, you could try something like this, which runs against nearly all of our test documents: private static final Set INCLUDES = new HashSet<>(); static { for
[jira] [Commented] (TIKA-2208) Catch missing libraires
[ https://issues.apache.org/jira/browse/TIKA-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15761055#comment-15761055 ] Tim Allison commented on TIKA-2208: --- Three cheers for unit tests! It looks like we need to add vnd.ms-powerpoint.template.macroenabled.12 to OOXMLParser's handled media types. I'll make that change shortly. Meanwhile, you could try something like this, which runs against nearly all of our test documents: private static final Set INCLUDES = new HashSet<>(); static { for (MediaType mediaType : OOXMLParser.SUPPORTED_TYPES) { if (mediaType.equals(MediaType.application("x-tika-ooxml"))) { continue; } INCLUDES.add(mediaType); } INCLUDES.add(MediaType.application("vnd.ms-powerpoint.template.macroenabled.12")); } private static final Set EXCLUDES = Collections.unmodifiableSet(new HashSet<>(Arrays.asList( MediaType.application("x-tika-ooxml") ))); private static final Parser DECORATED_PARSERS[] = new Parser[] { // documents new org.apache.tika.parser.html.HtmlParser(), new org.apache.tika.parser.rtf.RTFParser(), new org.apache.tika.parser.pdf.PDFParser(), new org.apache.tika.parser.txt.TXTParser(), new org.apache.tika.parser.microsoft.OfficeParser(), new org.apache.tika.parser.microsoft.OldExcelParser(), ParserDecorator.withTypes(ParserDecorator.withoutTypes( new org.apache.tika.parser.microsoft.ooxml.OOXMLParser(), EXCLUDES ), INCLUDES), new org.apache.tika.parser.odf.OpenDocumentParser(), new org.apache.tika.parser.iwork.IWorkPackageParser(), new org.apache.tika.parser.xml.DcXMLParser(), new org.apache.tika.parser.epub.EpubParser(), }; private static final Parser STANDARD_PARSERS[] = new Parser[] { // documents new org.apache.tika.parser.html.HtmlParser(), new org.apache.tika.parser.rtf.RTFParser(), new org.apache.tika.parser.pdf.PDFParser(), new org.apache.tika.parser.txt.TXTParser(), new org.apache.tika.parser.microsoft.OfficeParser(), new org.apache.tika.parser.microsoft.OldExcelParser(), new org.apache.tika.parser.microsoft.ooxml.OOXMLParser(), new org.apache.tika.parser.odf.OpenDocumentParser(), new org.apache.tika.parser.iwork.IWorkPackageParser(), new org.apache.tika.parser.xml.DcXMLParser(), new org.apache.tika.parser.epub.EpubParser(), }; private static final AutoDetectParser DECORATED_PARSER_INSTANCE = new AutoDetectParser(DECORATED_PARSERS); private static final AutoDetectParser STANDARD_PARSER_INSTANCE = new AutoDetectParser(STANDARD_PARSERS); private static final Tika DECORATED_TIKA = new Tika(DECORATED_PARSER_INSTANCE.getDetector(), DECORATED_PARSER_INSTANCE); private static final Tika STANDARD_TIKA = new Tika(STANDARD_PARSER_INSTANCE.getDetector(), STANDARD_PARSER_INSTANCE); @Test public void testSkipVisioOOXML() throws Exception { for (File f : getResourceAsFile("/test-documents").listFiles()) { if (f.isDirectory()) { continue; } if (f.getName().contains("VISIO") && (f.getName().endsWith("x") || f.getName().endsWith("m"))) { continue; } if (f.getName().contains("embeddedVsdx")) { continue; } boolean decoratedEx = false; boolean standardEx = false; String decoratedOutput = ""; String standardOutput = ""; try (InputStream is = TikaInputStream.get(f)) { decoratedOutput = DECORATED_TIKA.parseToString(is); } catch (Throwable e) { decoratedEx = true; } try (InputStream is = TikaInputStream.get(f)) { standardOutput = STANDARD_TIKA.parseToString(is); } catch (Throwable e) { standardEx = true; } assertEquals(f.getName(), standardEx, decoratedEx); if (standardEx == false) { assertEquals(f.getName(), standardOutput, decoratedOutput); } } } > Catch missing libraires > --- > > Key: TIKA-2208 > URL: https://issues.apache.org/jira/browse/TIKA-2208 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: David Pilato > > Hi there > We have decided to remove support for some formats when using Tika to extract > text and metadata. > We defined our list of Parsers: > {code:java} > private static final
[jira] [Commented] (TIKA-2211) ePub formatting instructions appear in plain text output
[ https://issues.apache.org/jira/browse/TIKA-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15760978#comment-15760978 ] Tim Allison commented on TIKA-2211: --- The ePub parser is using a straight SAXParser with no modifications. Looks like we should modify it slightly to ignore sections? {noformat} /**/ {noformat} > ePub formatting instructions appear in plain text output > > > Key: TIKA-2211 > URL: https://issues.apache.org/jira/browse/TIKA-2211 > Project: Tika > Issue Type: Bug >Affects Versions: 1.14 > Environment: I tested this on on Mac OSX 10.11.6 with Oracle JDK > 1.8.0_112. The Tika stand-alone application was launched as follows: > {code} > java -jar tika-app-1.14.jar > {code} >Reporter: Adam Carroll > > For some ePub files, format information appears in the plain text output > produced by Apache Tika. For example the Tika stand-alone application shows > the following text for the file “Don Quijote de la Mancha - Miguel de > Cervantes.epub” (dowloaded > [here|http://www.literanda.com/don-quijote-de-la-mancha--miguel-de-cervantes--epub]): > {code} > /**/ > p.sgc-2 {font-style: italic; text-align: right} > p.sgc-1 {text-align: justify;} > h3.sgc-3 {text-align: center;} > /**/ > Al duque de Béjar > Marqués de Gibraleón, conde de Benalcázar y Bañares, vizconde de La Puebla de > Alcocer, señor de las villas de Capilla, Curiel y Burguillos > En fe del buen acogimiento y honra que hace Vuestra Excelencia a toda suerte > de libros, como príncipe tan inclinado a favorecer las buenas artes, > mayormente las que por su nobleza no se abaten al servicio y granjerías del > vulgo, he determinado de sacar a luz El ingenioso hidalgo don Quijote de la > Mancha, al abrigo del clarísimo nombre de Vuestra Excelencia, a quien, con el > acatamiento que debo a tanta grandeza, suplico le reciba agradablemente en su > protección, para que a su sombra, aunque desnudo de aquel precioso ornamento > de elegancia y erudición de que suelen andar vestidas las obras que se > componen en las casas de los hombres que saben, ose parecer seguramente en el > juicio de algunos que, conteniéndose en los límites de su ignorancia, suelen > condenar con más rigor y menos justicia los trabajos ajenos; que, poniendo > los ojos la prudencia de Vuestra Excelencia en mi buen deseo, fío que no > desdeñará la cortedad de tan humilde servicio. > {code} > To reproduce this problem run the stand-alone version of Tika and open an > affected ePub file such as the one mentioned above. Then go to View -> Plain > Text. You should see the problem there. > By the way, thanks for making Apache Tika a really useful library. Keep up > the good work! -- This message was sent by Atlassian JIRA (v6.3.4#6332)