[jira] [Issue Comment Deleted] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/TIKA-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sharath Kumar updated TIKA-2146: Comment: was deleted (was: Hi Tim, Can you please remove the document Test.doc. Seems it contains sensitive data. Thanks) > Unable to extract contents from protected MS > word-doc-java.lang.ArrayIndexOutOfBoundsException > -- > > Key: TIKA-2146 > URL: https://issues.apache.org/jira/browse/TIKA-2146 > Project: Tika > Issue Type: Bug > Components: core, parser >Affects Versions: 1.11 > Environment: Windows 7 >Reporter: Sharath Kumar > Attachments: This is password protected.doc > > > When I try to parse a MS word document which is protected, I am unable to > extract the content rather, i get the below exception > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@29402a40 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.Tika.parseToString(Tika.java:537) > at > org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:102) > at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:1) > at java.security.AccessController.doPrivileged(Native Method) > at org.elasticsearch.mapper.attachments.TikaImpl.parse(TikaImpl.java:99) > at > org.elasticsearch.mapper.attachments.AttachmentMapper.parse(AttachmentMapper.java:482) > at > org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:309) > at > org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:436) > at > org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:262) > at > org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:122) > at > org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:309) > at > org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:529) > at > org.elasticsearch.index.shard.IndexShard.prepareCreateOnPrimary(IndexShard.java:506) > at > org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:215) > at > org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:224) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639) > at > org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271) > at > org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75) > at > org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376) > at > org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ArrayIndexOutOfBoundsException > at org.apache.poi.hwpf.model.SectionTable.(SectionTable.java:84) > at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:345) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/TIKA-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sharath Kumar updated TIKA-2146: Attachment: (was: Test bug.doc) > Unable to extract contents from protected MS > word-doc-java.lang.ArrayIndexOutOfBoundsException > -- > > Key: TIKA-2146 > URL: https://issues.apache.org/jira/browse/TIKA-2146 > Project: Tika > Issue Type: Bug > Components: core, parser >Affects Versions: 1.11 > Environment: Windows 7 >Reporter: Sharath Kumar > Attachments: This is password protected.doc > > > When I try to parse a MS word document which is protected, I am unable to > extract the content rather, i get the below exception > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@29402a40 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.Tika.parseToString(Tika.java:537) > at > org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:102) > at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:1) > at java.security.AccessController.doPrivileged(Native Method) > at org.elasticsearch.mapper.attachments.TikaImpl.parse(TikaImpl.java:99) > at > org.elasticsearch.mapper.attachments.AttachmentMapper.parse(AttachmentMapper.java:482) > at > org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:309) > at > org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:436) > at > org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:262) > at > org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:122) > at > org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:309) > at > org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:529) > at > org.elasticsearch.index.shard.IndexShard.prepareCreateOnPrimary(IndexShard.java:506) > at > org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:215) > at > org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:224) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639) > at > org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271) > at > org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75) > at > org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376) > at > org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ArrayIndexOutOfBoundsException > at org.apache.poi.hwpf.model.SectionTable.(SectionTable.java:84) > at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:345) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/TIKA-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15931048#comment-15931048 ] Sharath Kumar commented on TIKA-2146: - Hi Tim, Can you please remove the document Test.doc. Seems it contains sensitive data. Thanks > Unable to extract contents from protected MS > word-doc-java.lang.ArrayIndexOutOfBoundsException > -- > > Key: TIKA-2146 > URL: https://issues.apache.org/jira/browse/TIKA-2146 > Project: Tika > Issue Type: Bug > Components: core, parser >Affects Versions: 1.11 > Environment: Windows 7 >Reporter: Sharath Kumar > Attachments: Test bug.doc, This is password protected.doc > > > When I try to parse a MS word document which is protected, I am unable to > extract the content rather, i get the below exception > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@29402a40 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.Tika.parseToString(Tika.java:537) > at > org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:102) > at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:1) > at java.security.AccessController.doPrivileged(Native Method) > at org.elasticsearch.mapper.attachments.TikaImpl.parse(TikaImpl.java:99) > at > org.elasticsearch.mapper.attachments.AttachmentMapper.parse(AttachmentMapper.java:482) > at > org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:309) > at > org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:436) > at > org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:262) > at > org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:122) > at > org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:309) > at > org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:529) > at > org.elasticsearch.index.shard.IndexShard.prepareCreateOnPrimary(IndexShard.java:506) > at > org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:215) > at > org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:224) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639) > at > org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271) > at > org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75) > at > org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376) > at > org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ArrayIndexOutOfBoundsException > at org.apache.poi.hwpf.model.SectionTable.(SectionTable.java:84) > at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:345) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) -- This message was sent by Atlassian JIRA
[jira] [Updated] (TIKA-2285) Caused by: java.lang.StringIndexOutOfBoundsException - org.apache.tika.parser.microsoft.WordExtractor.buildParagraphTagAndStyle
[ https://issues.apache.org/jira/browse/TIKA-2285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sharath Kumar updated TIKA-2285: Attachment: XAPPLICANT__2016.docx > Caused by: java.lang.StringIndexOutOfBoundsException - > org.apache.tika.parser.microsoft.WordExtractor.buildParagraphTagAndStyle > --- > > Key: TIKA-2285 > URL: https://issues.apache.org/jira/browse/TIKA-2285 > Project: Tika > Issue Type: Bug > Components: core, parser >Affects Versions: 1.13 >Reporter: Sharath Kumar > Attachments: XAPPLICANT__2016.docx > > > Getting the below error when parsing word DOC > Caused by: java.lang.StringIndexOutOfBoundsException: String index out of > range: 1 > at java.lang.String.substring(String.java:1963) > at > org.apache.tika.parser.microsoft.WordExtractor.buildParagraphTagAndStyle(WordExtractor.java:126) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (TIKA-2285) Caused by: java.lang.StringIndexOutOfBoundsException - org.apache.tika.parser.microsoft.WordExtractor.buildParagraphTagAndStyle
Sharath Kumar created TIKA-2285: --- Summary: Caused by: java.lang.StringIndexOutOfBoundsException - org.apache.tika.parser.microsoft.WordExtractor.buildParagraphTagAndStyle Key: TIKA-2285 URL: https://issues.apache.org/jira/browse/TIKA-2285 Project: Tika Issue Type: Bug Components: core, parser Affects Versions: 1.13 Reporter: Sharath Kumar Getting the below error when parsing word DOC Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: 1 at java.lang.String.substring(String.java:1963) at org.apache.tika.parser.microsoft.WordExtractor.buildParagraphTagAndStyle(WordExtractor.java:126) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TIKA-2284) Caused by: org.apache.xmlbeans.XmlException: error: The document is not a ftr@http://schemas.openxmlformats.org/wordprocessingml/2006/main: document element local name m
[ https://issues.apache.org/jira/browse/TIKA-2284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15889879#comment-15889879 ] Sharath Kumar commented on TIKA-2284: - I am not able to add the attachment here cause, after removing the confidential info in the doc, if I save and try to parse, i wont get the exception. Even if i modify a bit and save the file, I cannot reproduce the issue. But as the original doc contains user details, i cant upload here > Caused by: org.apache.xmlbeans.XmlException: error: The document is not a > ftr@http://schemas.openxmlformats.org/wordprocessingml/2006/main: document > element local name mismatch expected ftr got hdr > - > > Key: TIKA-2284 > URL: https://issues.apache.org/jira/browse/TIKA-2284 > Project: Tika > Issue Type: Bug > Components: core, parser >Affects Versions: 1.13 >Reporter: Sharath Kumar > > I get the below parsing error for the attached doc > Caused by: org.apache.xmlbeans.XmlException: error: The document is not a > ftr@http://schemas.openxmlformats.org/wordprocessingml/2006/main: document > element local name mismatch expected ftr got hdr > at org.apache.xmlbeans.impl.store.Locale.verifyDocumentType(Locale.java:459) > at org.apache.xmlbeans.impl.store.Locale.autoTypeDocument(Locale.java:364) > at -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (TIKA-2284) Caused by: org.apache.xmlbeans.XmlException: error: The document is not a ftr@http://schemas.openxmlformats.org/wordprocessingml/2006/main: document element local name mis
Sharath Kumar created TIKA-2284: --- Summary: Caused by: org.apache.xmlbeans.XmlException: error: The document is not a ftr@http://schemas.openxmlformats.org/wordprocessingml/2006/main: document element local name mismatch expected ftr got hdr Key: TIKA-2284 URL: https://issues.apache.org/jira/browse/TIKA-2284 Project: Tika Issue Type: Bug Components: core, parser Affects Versions: 1.13 Reporter: Sharath Kumar I get the below parsing error for the attached doc Caused by: org.apache.xmlbeans.XmlException: error: The document is not a ftr@http://schemas.openxmlformats.org/wordprocessingml/2006/main: document element local name mismatch expected ftr got hdr at org.apache.xmlbeans.impl.store.Locale.verifyDocumentType(Locale.java:459) at org.apache.xmlbeans.impl.store.Locale.autoTypeDocument(Locale.java:364) at -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (TIKA-2283) Pap style 16 claimed to have itself as its parent, which isn't allowed
[ https://issues.apache.org/jira/browse/TIKA-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sharath Kumar updated TIKA-2283: Attachment: Test_doc.doc > Pap style 16 claimed to have itself as its parent, which isn't allowed > -- > > Key: TIKA-2283 > URL: https://issues.apache.org/jira/browse/TIKA-2283 > Project: Tika > Issue Type: Bug > Components: core, parser >Affects Versions: 1.13 >Reporter: Sharath Kumar > Attachments: Test_doc.doc > > > For the attached document, i get the below error when parsing > Caused by: java.lang.IllegalStateException: Pap style 16 claimed to have > itself as its parent, which isn't allowed > at org.apache.poi.hwpf.model.StyleSheet.createPap(StyleSheet.java:232) > at org.apache.poi.hwpf.model.StyleSheet.(StyleSheet.java:120) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (TIKA-2283) Pap style 16 claimed to have itself as its parent, which isn't allowed
Sharath Kumar created TIKA-2283: --- Summary: Pap style 16 claimed to have itself as its parent, which isn't allowed Key: TIKA-2283 URL: https://issues.apache.org/jira/browse/TIKA-2283 Project: Tika Issue Type: Bug Components: core, parser Affects Versions: 1.13 Reporter: Sharath Kumar For the attached document, i get the below error when parsing Caused by: java.lang.IllegalStateException: Pap style 16 claimed to have itself as its parent, which isn't allowed at org.apache.poi.hwpf.model.StyleSheet.createPap(StyleSheet.java:232) at org.apache.poi.hwpf.model.StyleSheet.(StyleSheet.java:120) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TIKA-2258) Unable to parse .pub files -java.lang.ArrayIndexOutOfBoundsException: 88
[ https://issues.apache.org/jira/browse/TIKA-2258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851037#comment-15851037 ] Sharath Kumar commented on TIKA-2258: - Thanks Tim. https://bz.apache.org/bugzilla/show_bug.cgi?id=60685 > Unable to parse .pub files -java.lang.ArrayIndexOutOfBoundsException: 88 > > > Key: TIKA-2258 > URL: https://issues.apache.org/jira/browse/TIKA-2258 > Project: Tika > Issue Type: Bug > Components: core, parser >Affects Versions: 1.13 > Environment: Windows 7 >Reporter: Sharath Kumar > Attachments: Roc.pub > > > When i try to parse the attached .pub file, it fails with the below exception > Caused by: java.lang.ArrayIndexOutOfBoundsException: 88 > at org.apache.poi.util.LittleEndian.getUShort(LittleEndian.java:343) > at > org.apache.poi.hpbf.model.qcbits.QCPLCBit$Type12.(QCPLCBit.java:215) > at > org.apache.poi.hpbf.model.qcbits.QCPLCBit$Type12.(QCPLCBit.java:176) > at > org.apache.poi.hpbf.model.qcbits.QCPLCBit.createQCPLCBit(QCPLCBit.java:90) > at org.apache.poi.hpbf.model.QuillContents.(QuillContents.java:71) > at org.apache.poi.hpbf.HPBFDocument.(HPBFDocument.java:67) > at > org.apache.poi.hpbf.extractor.PublisherTextExtractor.(PublisherTextExtractor.java:45) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:141) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 28 more -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (TIKA-2258) Unable to parse .pub files -java.lang.ArrayIndexOutOfBoundsException: 88
[ https://issues.apache.org/jira/browse/TIKA-2258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sharath Kumar updated TIKA-2258: Attachment: Roc.pub Test document which can be used to replicate the error > Unable to parse .pub files -java.lang.ArrayIndexOutOfBoundsException: 88 > > > Key: TIKA-2258 > URL: https://issues.apache.org/jira/browse/TIKA-2258 > Project: Tika > Issue Type: Bug > Components: core, parser >Affects Versions: 1.13 > Environment: Windows 7 >Reporter: Sharath Kumar > Attachments: Roc.pub > > > When i try to parse the attached .pub file, it fails with the below exception > Caused by: java.lang.ArrayIndexOutOfBoundsException: 88 > at org.apache.poi.util.LittleEndian.getUShort(LittleEndian.java:343) > at > org.apache.poi.hpbf.model.qcbits.QCPLCBit$Type12.(QCPLCBit.java:215) > at > org.apache.poi.hpbf.model.qcbits.QCPLCBit$Type12.(QCPLCBit.java:176) > at > org.apache.poi.hpbf.model.qcbits.QCPLCBit.createQCPLCBit(QCPLCBit.java:90) > at org.apache.poi.hpbf.model.QuillContents.(QuillContents.java:71) > at org.apache.poi.hpbf.HPBFDocument.(HPBFDocument.java:67) > at > org.apache.poi.hpbf.extractor.PublisherTextExtractor.(PublisherTextExtractor.java:45) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:141) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 28 more -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (TIKA-2258) Unable to parse .pub files -java.lang.ArrayIndexOutOfBoundsException: 88
Sharath Kumar created TIKA-2258: --- Summary: Unable to parse .pub files -java.lang.ArrayIndexOutOfBoundsException: 88 Key: TIKA-2258 URL: https://issues.apache.org/jira/browse/TIKA-2258 Project: Tika Issue Type: Bug Components: core, parser Affects Versions: 1.13 Environment: Windows 7 Reporter: Sharath Kumar When i try to parse the attached .pub file, it fails with the below exception Caused by: java.lang.ArrayIndexOutOfBoundsException: 88 at org.apache.poi.util.LittleEndian.getUShort(LittleEndian.java:343) at org.apache.poi.hpbf.model.qcbits.QCPLCBit$Type12.(QCPLCBit.java:215) at org.apache.poi.hpbf.model.qcbits.QCPLCBit$Type12.(QCPLCBit.java:176) at org.apache.poi.hpbf.model.qcbits.QCPLCBit.createQCPLCBit(QCPLCBit.java:90) at org.apache.poi.hpbf.model.QuillContents.(QuillContents.java:71) at org.apache.poi.hpbf.HPBFDocument.(HPBFDocument.java:67) at org.apache.poi.hpbf.extractor.PublisherTextExtractor.(PublisherTextExtractor.java:45) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:141) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ... 28 more -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/TIKA-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15635502#comment-15635502 ] Sharath Kumar commented on TIKA-2146: - What would be action plan for this. is this gonna be supported in Tika or not > Unable to extract contents from protected MS > word-doc-java.lang.ArrayIndexOutOfBoundsException > -- > > Key: TIKA-2146 > URL: https://issues.apache.org/jira/browse/TIKA-2146 > Project: Tika > Issue Type: Bug > Components: core, parser >Affects Versions: 1.11 > Environment: Windows 7 >Reporter: Sharath Kumar > Attachments: Test bug.doc, This is password protected.doc > > > When I try to parse a MS word document which is protected, I am unable to > extract the content rather, i get the below exception > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@29402a40 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.Tika.parseToString(Tika.java:537) > at > org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:102) > at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:1) > at java.security.AccessController.doPrivileged(Native Method) > at org.elasticsearch.mapper.attachments.TikaImpl.parse(TikaImpl.java:99) > at > org.elasticsearch.mapper.attachments.AttachmentMapper.parse(AttachmentMapper.java:482) > at > org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:309) > at > org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:436) > at > org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:262) > at > org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:122) > at > org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:309) > at > org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:529) > at > org.elasticsearch.index.shard.IndexShard.prepareCreateOnPrimary(IndexShard.java:506) > at > org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:215) > at > org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:224) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639) > at > org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271) > at > org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75) > at > org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376) > at > org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ArrayIndexOutOfBoundsException > at org.apache.poi.hwpf.model.SectionTable.(SectionTable.java:84) > at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:345) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2147) ClassCastException on a valid Word template
[ https://issues.apache.org/jira/browse/TIKA-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15629203#comment-15629203 ] Sharath Kumar commented on TIKA-2147: - Thanks [~talli...@mitre.org] > ClassCastException on a valid Word template > --- > > Key: TIKA-2147 > URL: https://issues.apache.org/jira/browse/TIKA-2147 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: Forefront Fax.dotx, basicresume.docx > > > On the attached document template, which opens fine in Word, the Tika parser > throws the following error: > java.lang.ClassCastException: org.apache.poi.POIXMLDocumentPart cannot be > cast to org.apache.poi.xwpf.usermodel.XWPFDocument > at > org.apache.poi.xwpf.usermodel.XWPFFootnotes.getXWPFDocument(XWPFFootnotes.java:162) > at > org.apache.poi.xwpf.usermodel.XWPFFootnote.(XWPFFootnote.java:47) > at > org.apache.poi.xwpf.usermodel.XWPFFootnotes.onDocumentRead(XWPFFootnotes.java:95) > at > org.apache.poi.POIXMLDocumentPart._invokeOnDocumentRead(POIXMLDocumentPart.java:658) > at > org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:235) > at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160) > at > org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:124) > at > org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58) > at > org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2147) ClassCastException on a valid Word template
[ https://issues.apache.org/jira/browse/TIKA-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sharath Kumar updated TIKA-2147: Attachment: basicresume.docx > ClassCastException on a valid Word template > --- > > Key: TIKA-2147 > URL: https://issues.apache.org/jira/browse/TIKA-2147 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: Forefront Fax.dotx, basicresume.docx > > > On the attached document template, which opens fine in Word, the Tika parser > throws the following error: > java.lang.ClassCastException: org.apache.poi.POIXMLDocumentPart cannot be > cast to org.apache.poi.xwpf.usermodel.XWPFDocument > at > org.apache.poi.xwpf.usermodel.XWPFFootnotes.getXWPFDocument(XWPFFootnotes.java:162) > at > org.apache.poi.xwpf.usermodel.XWPFFootnote.(XWPFFootnote.java:47) > at > org.apache.poi.xwpf.usermodel.XWPFFootnotes.onDocumentRead(XWPFFootnotes.java:95) > at > org.apache.poi.POIXMLDocumentPart._invokeOnDocumentRead(POIXMLDocumentPart.java:658) > at > org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:235) > at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160) > at > org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:124) > at > org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58) > at > org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2147) ClassCastException on a valid Word template
[ https://issues.apache.org/jira/browse/TIKA-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15615588#comment-15615588 ] Sharath Kumar commented on TIKA-2147: - I get the similar issue for docx too . I have attached the document which can reproduce the issue > ClassCastException on a valid Word template > --- > > Key: TIKA-2147 > URL: https://issues.apache.org/jira/browse/TIKA-2147 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: Forefront Fax.dotx > > > On the attached document template, which opens fine in Word, the Tika parser > throws the following error: > java.lang.ClassCastException: org.apache.poi.POIXMLDocumentPart cannot be > cast to org.apache.poi.xwpf.usermodel.XWPFDocument > at > org.apache.poi.xwpf.usermodel.XWPFFootnotes.getXWPFDocument(XWPFFootnotes.java:162) > at > org.apache.poi.xwpf.usermodel.XWPFFootnote.(XWPFFootnote.java:47) > at > org.apache.poi.xwpf.usermodel.XWPFFootnotes.onDocumentRead(XWPFFootnotes.java:95) > at > org.apache.poi.POIXMLDocumentPart._invokeOnDocumentRead(POIXMLDocumentPart.java:658) > at > org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:235) > at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160) > at > org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:124) > at > org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58) > at > org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2149) org.apache.poi.POIXMLDocumentPart cannot be cast to org.apache.poi.xwpf.usermodel.XWPFDocument - MS Word docx
[ https://issues.apache.org/jira/browse/TIKA-2149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15615584#comment-15615584 ] Sharath Kumar commented on TIKA-2149: - Tika 2147, the input document is a word template. However not in my case > org.apache.poi.POIXMLDocumentPart cannot be cast to > org.apache.poi.xwpf.usermodel.XWPFDocument - MS Word docx > -- > > Key: TIKA-2149 > URL: https://issues.apache.org/jira/browse/TIKA-2149 > Project: Tika > Issue Type: Bug > Components: core, parser >Affects Versions: 1.11, 1.13 > Environment: Windows 7 . Linux RHEL 7 >Reporter: Sharath Kumar > > When I run the attached document(.docx) against tika 1.11 or tika 1.13 to > extract contents, it errors out with the below exception > Exception in thread "main" org.apache.tika.exception.TikaException: > Unexpected RuntimeException from > org.apache.tika.parser.microsoft.ooxml.OOXMLParser@1ea9f6af > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:191) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:480) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145) > Caused by: java.lang.ClassCastException: org.apache.poi.POIXMLDocumentPart > cannot be cast to org.apache.poi.xwpf.usermodel.XWPFDocument > at > org.apache.poi.xwpf.usermodel.XWPFFootnotes.getXWPFDocument(XWPFFootnotes.java:162) > at > org.apache.poi.xwpf.usermodel.XWPFFootnote.(XWPFFootnote.java:47) > at > org.apache.poi.xwpf.usermodel.XWPFFootnotes.onDocumentRead(XWPFFootnotes.java:95) > at > org.apache.poi.POIXMLDocumentPart._invokeOnDocumentRead(POIXMLDocumentPart.java:658) > at > org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:235) > at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160) > at > org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:124) > at > org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58) > at > org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 5 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-2149) org.apache.poi.POIXMLDocumentPart cannot be cast to org.apache.poi.xwpf.usermodel.XWPFDocument - MS Word docx
[ https://issues.apache.org/jira/browse/TIKA-2149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15615584#comment-15615584 ] Sharath Kumar edited comment on TIKA-2149 at 10/28/16 2:37 PM: --- Bug Tika-2147, the input document is a word template. However not in my case was (Author: mnsk07): Tika 2147, the input document is a word template. However not in my case > org.apache.poi.POIXMLDocumentPart cannot be cast to > org.apache.poi.xwpf.usermodel.XWPFDocument - MS Word docx > -- > > Key: TIKA-2149 > URL: https://issues.apache.org/jira/browse/TIKA-2149 > Project: Tika > Issue Type: Bug > Components: core, parser >Affects Versions: 1.11, 1.13 > Environment: Windows 7 . Linux RHEL 7 >Reporter: Sharath Kumar > > When I run the attached document(.docx) against tika 1.11 or tika 1.13 to > extract contents, it errors out with the below exception > Exception in thread "main" org.apache.tika.exception.TikaException: > Unexpected RuntimeException from > org.apache.tika.parser.microsoft.ooxml.OOXMLParser@1ea9f6af > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:191) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:480) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145) > Caused by: java.lang.ClassCastException: org.apache.poi.POIXMLDocumentPart > cannot be cast to org.apache.poi.xwpf.usermodel.XWPFDocument > at > org.apache.poi.xwpf.usermodel.XWPFFootnotes.getXWPFDocument(XWPFFootnotes.java:162) > at > org.apache.poi.xwpf.usermodel.XWPFFootnote.(XWPFFootnote.java:47) > at > org.apache.poi.xwpf.usermodel.XWPFFootnotes.onDocumentRead(XWPFFootnotes.java:95) > at > org.apache.poi.POIXMLDocumentPart._invokeOnDocumentRead(POIXMLDocumentPart.java:658) > at > org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:235) > at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160) > at > org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:124) > at > org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58) > at > org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 5 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/TIKA-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sharath Kumar updated TIKA-2146: Comment: was deleted (was: Does tika support extracting the contents of a protected MS-word document. The document is however not a password protected though.) > Unable to extract contents from protected MS > word-doc-java.lang.ArrayIndexOutOfBoundsException > -- > > Key: TIKA-2146 > URL: https://issues.apache.org/jira/browse/TIKA-2146 > Project: Tika > Issue Type: Bug > Components: core, parser >Affects Versions: 1.11 > Environment: Windows 7 >Reporter: Sharath Kumar > Attachments: Test bug.doc, This is password protected.doc > > > When I try to parse a MS word document which is protected, I am unable to > extract the content rather, i get the below exception > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@29402a40 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.Tika.parseToString(Tika.java:537) > at > org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:102) > at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:1) > at java.security.AccessController.doPrivileged(Native Method) > at org.elasticsearch.mapper.attachments.TikaImpl.parse(TikaImpl.java:99) > at > org.elasticsearch.mapper.attachments.AttachmentMapper.parse(AttachmentMapper.java:482) > at > org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:309) > at > org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:436) > at > org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:262) > at > org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:122) > at > org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:309) > at > org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:529) > at > org.elasticsearch.index.shard.IndexShard.prepareCreateOnPrimary(IndexShard.java:506) > at > org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:215) > at > org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:224) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639) > at > org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271) > at > org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75) > at > org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376) > at > org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ArrayIndexOutOfBoundsException > at org.apache.poi.hwpf.model.SectionTable.(SectionTable.java:84) > at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:345) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) -- This message was sent by At
[jira] [Commented] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/TIKA-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15614659#comment-15614659 ] Sharath Kumar commented on TIKA-2146: - Does tika support extracting the contents of a protected MS-word document. The document is however not a password protected though. > Unable to extract contents from protected MS > word-doc-java.lang.ArrayIndexOutOfBoundsException > -- > > Key: TIKA-2146 > URL: https://issues.apache.org/jira/browse/TIKA-2146 > Project: Tika > Issue Type: Bug > Components: core, parser >Affects Versions: 1.11 > Environment: Windows 7 >Reporter: Sharath Kumar > Attachments: Test bug.doc, This is password protected.doc > > > When I try to parse a MS word document which is protected, I am unable to > extract the content rather, i get the below exception > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@29402a40 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.Tika.parseToString(Tika.java:537) > at > org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:102) > at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:1) > at java.security.AccessController.doPrivileged(Native Method) > at org.elasticsearch.mapper.attachments.TikaImpl.parse(TikaImpl.java:99) > at > org.elasticsearch.mapper.attachments.AttachmentMapper.parse(AttachmentMapper.java:482) > at > org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:309) > at > org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:436) > at > org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:262) > at > org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:122) > at > org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:309) > at > org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:529) > at > org.elasticsearch.index.shard.IndexShard.prepareCreateOnPrimary(IndexShard.java:506) > at > org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:215) > at > org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:224) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639) > at > org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271) > at > org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75) > at > org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376) > at > org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ArrayIndexOutOfBoundsException > at org.apache.poi.hwpf.model.SectionTable.(SectionTable.java:84) > at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:345) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) -- Th
[jira] [Commented] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/TIKA-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15614660#comment-15614660 ] Sharath Kumar commented on TIKA-2146: - Does tika support extracting the contents of a protected MS-word document. The document is however not a password protected though. > Unable to extract contents from protected MS > word-doc-java.lang.ArrayIndexOutOfBoundsException > -- > > Key: TIKA-2146 > URL: https://issues.apache.org/jira/browse/TIKA-2146 > Project: Tika > Issue Type: Bug > Components: core, parser >Affects Versions: 1.11 > Environment: Windows 7 >Reporter: Sharath Kumar > Attachments: Test bug.doc, This is password protected.doc > > > When I try to parse a MS word document which is protected, I am unable to > extract the content rather, i get the below exception > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@29402a40 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.Tika.parseToString(Tika.java:537) > at > org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:102) > at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:1) > at java.security.AccessController.doPrivileged(Native Method) > at org.elasticsearch.mapper.attachments.TikaImpl.parse(TikaImpl.java:99) > at > org.elasticsearch.mapper.attachments.AttachmentMapper.parse(AttachmentMapper.java:482) > at > org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:309) > at > org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:436) > at > org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:262) > at > org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:122) > at > org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:309) > at > org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:529) > at > org.elasticsearch.index.shard.IndexShard.prepareCreateOnPrimary(IndexShard.java:506) > at > org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:215) > at > org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:224) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639) > at > org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271) > at > org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75) > at > org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376) > at > org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ArrayIndexOutOfBoundsException > at org.apache.poi.hwpf.model.SectionTable.(SectionTable.java:84) > at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:345) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) -- Th
[jira] [Created] (TIKA-2149) org.apache.poi.POIXMLDocumentPart cannot be cast to org.apache.poi.xwpf.usermodel.XWPFDocument - MS Word docx
Sharath Kumar created TIKA-2149: --- Summary: org.apache.poi.POIXMLDocumentPart cannot be cast to org.apache.poi.xwpf.usermodel.XWPFDocument - MS Word docx Key: TIKA-2149 URL: https://issues.apache.org/jira/browse/TIKA-2149 Project: Tika Issue Type: Bug Components: core, parser Affects Versions: 1.13, 1.11 Environment: Windows 7 . Linux RHEL 7 Reporter: Sharath Kumar When I run the attached document(.docx) against tika 1.11 or tika 1.13 to extract contents, it errors out with the below exception Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@1ea9f6af at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:191) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:480) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145) Caused by: java.lang.ClassCastException: org.apache.poi.POIXMLDocumentPart cannot be cast to org.apache.poi.xwpf.usermodel.XWPFDocument at org.apache.poi.xwpf.usermodel.XWPFFootnotes.getXWPFDocument(XWPFFootnotes.java:162) at org.apache.poi.xwpf.usermodel.XWPFFootnote.(XWPFFootnote.java:47) at org.apache.poi.xwpf.usermodel.XWPFFootnotes.onDocumentRead(XWPFFootnotes.java:95) at org.apache.poi.POIXMLDocumentPart._invokeOnDocumentRead(POIXMLDocumentPart.java:658) at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:235) at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160) at org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:124) at org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58) at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ... 5 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/TIKA-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15614367#comment-15614367 ] Sharath Kumar commented on TIKA-2146: - [~talli...@mitre.org] I ran the same document that i have attached using tika 1.13 I get the below issue even in 1.13 . I have one more protected document MS Word 97( which I cant share due to the sensitive data in that, that also returns in error. Below are the error logs. I have question. Does tika support extrating the contents of a protected MS-word doument. The doument in question is not password prtotected though. Output 1: C:\Users\sk\Downloads>java -jar tika-app-1.13.jar Testbug.doc Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.Offic at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:191) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:480) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145) Caused by: java.lang.IllegalStateException: Told we're for characters 8236 -> 10293, but actually covers 2055 characters! at org.apache.poi.hwpf.model.TextPiece.(TextPiece.java:73) at org.apache.poi.hwpf.model.TextPieceTable.(TextPieceTable.java:112) at org.apache.poi.hwpf.model.ComplexFileTable.(ComplexFileTable.java:70) at org.apache.poi.hwpf.HWPFOldDocument.(HWPFOldDocument.java:72) at org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:602) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:146) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ... 5 more Output 2: Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@6f27a732 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:191) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:480) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145) Caused by: java.lang.ArrayIndexOutOfBoundsException at java.lang.System.arraycopy(Native Method) at org.apache.poi.hwpf.model.SectionTable.(SectionTable.java:84) at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:342) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ... 5 more > Unable to extract contents from protected MS > word-doc-java.lang.ArrayIndexOutOfBoundsException > -- > > Key: TIKA-2146 > URL: https://issues.apache.org/jira/browse/TIKA-2146 > Project: Tika > Issue Type: Bug > Components: core, parser >Affects Versions: 1.11 > Environment: Windows 7 >Reporter: Sharath Kumar > Attachments: Test bug.doc, This is password protected.doc > > > When I try to parse a MS word document which is protected, I am unable to > extract the content rather, i get the below exception > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@29402a40 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.Tika.parseToString(Tika.java:537) > at > org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:102) > at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:1) > at java.security.AccessController.doPrivileged(Native Method) > at org.elasticsearch.mapper.attachments.TikaImpl.parse(TikaImpl.java:99) > at > org.elasticsearch.mapper.attachments.AttachmentMapper.parse(AttachmentMapper.java:482) > at > org.elasticsearch.index.mapper.Doc
[jira] [Comment Edited] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/TIKA-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15611671#comment-15611671 ] Sharath Kumar edited comment on TIKA-2146 at 10/27/16 12:36 PM: Sure. I have uploaded the doc. The file is not password protected. I also see errors like the below for these type of docs(protected word docs) java.security.PrivilegedActionException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@29402a40 at java.security.AccessController.doPrivileged(Native Method) was (Author: mnsk07): Sure. I have uploaded the doc. The file is not password protected. I also see errors like the below for these type of docs java.security.PrivilegedActionException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@29402a40 at java.security.AccessController.doPrivileged(Native Method) > Unable to extract contents from protected MS > word-doc-java.lang.ArrayIndexOutOfBoundsException > -- > > Key: TIKA-2146 > URL: https://issues.apache.org/jira/browse/TIKA-2146 > Project: Tika > Issue Type: Bug > Components: core, parser >Affects Versions: 1.11 > Environment: Windows 7 >Reporter: Sharath Kumar > Attachments: Test bug.doc > > > When I try to parse a MS word document which is protected, I am unable to > extract the content rather, i get the below exception > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@29402a40 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.Tika.parseToString(Tika.java:537) > at > org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:102) > at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:1) > at java.security.AccessController.doPrivileged(Native Method) > at org.elasticsearch.mapper.attachments.TikaImpl.parse(TikaImpl.java:99) > at > org.elasticsearch.mapper.attachments.AttachmentMapper.parse(AttachmentMapper.java:482) > at > org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:309) > at > org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:436) > at > org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:262) > at > org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:122) > at > org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:309) > at > org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:529) > at > org.elasticsearch.index.shard.IndexShard.prepareCreateOnPrimary(IndexShard.java:506) > at > org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:215) > at > org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:224) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639) > at > org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271) > at > org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75) > at > org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376) > at > org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent
[jira] [Updated] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/TIKA-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sharath Kumar updated TIKA-2146: Attachment: Test bug.doc > Unable to extract contents from protected MS > word-doc-java.lang.ArrayIndexOutOfBoundsException > -- > > Key: TIKA-2146 > URL: https://issues.apache.org/jira/browse/TIKA-2146 > Project: Tika > Issue Type: Bug > Components: core, parser >Affects Versions: 1.11 > Environment: Windows 7 >Reporter: Sharath Kumar > Attachments: Test bug.doc > > > When I try to parse a MS word document which is protected, I am unable to > extract the content rather, i get the below exception > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@29402a40 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.Tika.parseToString(Tika.java:537) > at > org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:102) > at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:1) > at java.security.AccessController.doPrivileged(Native Method) > at org.elasticsearch.mapper.attachments.TikaImpl.parse(TikaImpl.java:99) > at > org.elasticsearch.mapper.attachments.AttachmentMapper.parse(AttachmentMapper.java:482) > at > org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:309) > at > org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:436) > at > org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:262) > at > org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:122) > at > org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:309) > at > org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:529) > at > org.elasticsearch.index.shard.IndexShard.prepareCreateOnPrimary(IndexShard.java:506) > at > org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:215) > at > org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:224) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639) > at > org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271) > at > org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75) > at > org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376) > at > org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ArrayIndexOutOfBoundsException > at org.apache.poi.hwpf.model.SectionTable.(SectionTable.java:84) > at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:345) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/TIKA-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15611671#comment-15611671 ] Sharath Kumar commented on TIKA-2146: - Sure. I have uploaded the doc. The file is not password protected. I also see errors like the below for these type of docs java.security.PrivilegedActionException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@29402a40 at java.security.AccessController.doPrivileged(Native Method) > Unable to extract contents from protected MS > word-doc-java.lang.ArrayIndexOutOfBoundsException > -- > > Key: TIKA-2146 > URL: https://issues.apache.org/jira/browse/TIKA-2146 > Project: Tika > Issue Type: Bug > Components: core, parser >Affects Versions: 1.11 > Environment: Windows 7 >Reporter: Sharath Kumar > Attachments: Test bug.doc > > > When I try to parse a MS word document which is protected, I am unable to > extract the content rather, i get the below exception > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@29402a40 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.Tika.parseToString(Tika.java:537) > at > org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:102) > at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:1) > at java.security.AccessController.doPrivileged(Native Method) > at org.elasticsearch.mapper.attachments.TikaImpl.parse(TikaImpl.java:99) > at > org.elasticsearch.mapper.attachments.AttachmentMapper.parse(AttachmentMapper.java:482) > at > org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:309) > at > org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:436) > at > org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:262) > at > org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:122) > at > org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:309) > at > org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:529) > at > org.elasticsearch.index.shard.IndexShard.prepareCreateOnPrimary(IndexShard.java:506) > at > org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:215) > at > org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:224) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639) > at > org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271) > at > org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75) > at > org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376) > at > org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ArrayIndexOutOfBoundsException > at org.apache.poi.hwpf.model.SectionTable.(SectionTable.java:84) > at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:345) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(Offi
[jira] [Updated] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/TIKA-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sharath Kumar updated TIKA-2146: Component/s: parser > Unable to extract contents from protected MS > word-doc-java.lang.ArrayIndexOutOfBoundsException > -- > > Key: TIKA-2146 > URL: https://issues.apache.org/jira/browse/TIKA-2146 > Project: Tika > Issue Type: Bug > Components: core, parser >Affects Versions: 1.11 > Environment: Windows 7 >Reporter: Sharath Kumar > > When I try to parse a MS word document which is protected, I am unable to > extract the content rather, i get the below exception > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@29402a40 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.Tika.parseToString(Tika.java:537) > at > org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:102) > at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:1) > at java.security.AccessController.doPrivileged(Native Method) > at org.elasticsearch.mapper.attachments.TikaImpl.parse(TikaImpl.java:99) > at > org.elasticsearch.mapper.attachments.AttachmentMapper.parse(AttachmentMapper.java:482) > at > org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:309) > at > org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:436) > at > org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:262) > at > org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:122) > at > org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:309) > at > org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:529) > at > org.elasticsearch.index.shard.IndexShard.prepareCreateOnPrimary(IndexShard.java:506) > at > org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:215) > at > org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:224) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639) > at > org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279) > at > org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271) > at > org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75) > at > org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376) > at > org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ArrayIndexOutOfBoundsException > at org.apache.poi.hwpf.model.SectionTable.(SectionTable.java:84) > at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:345) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException
Sharath Kumar created TIKA-2146: --- Summary: Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException Key: TIKA-2146 URL: https://issues.apache.org/jira/browse/TIKA-2146 Project: Tika Issue Type: Bug Components: core Affects Versions: 1.11 Environment: Windows 7 Reporter: Sharath Kumar When I try to parse a MS word document which is protected, I am unable to extract the content rather, i get the below exception org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@29402a40 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.Tika.parseToString(Tika.java:537) at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:102) at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:1) at java.security.AccessController.doPrivileged(Native Method) at org.elasticsearch.mapper.attachments.TikaImpl.parse(TikaImpl.java:99) at org.elasticsearch.mapper.attachments.AttachmentMapper.parse(AttachmentMapper.java:482) at org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:309) at org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:436) at org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:262) at org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:122) at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:309) at org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:529) at org.elasticsearch.index.shard.IndexShard.prepareCreateOnPrimary(IndexShard.java:506) at org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:215) at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:224) at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326) at org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389) at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191) at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68) at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279) at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271) at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75) at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ArrayIndexOutOfBoundsException at org.apache.poi.hwpf.model.SectionTable.(SectionTable.java:84) at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:345) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) -- This message was sent by Atlassian JIRA (v6.3.4#6332)