[jira] [Commented] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException

2016-10-28 Thread Sharath Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15614660#comment-15614660
 ] 

Sharath Kumar commented on TIKA-2146:
-

 Does tika support extracting the contents of a protected MS-word document. The 
document is however not a password protected though.

> Unable to extract contents from protected MS 
> word-doc-java.lang.ArrayIndexOutOfBoundsException
> --
>
> Key: TIKA-2146
> URL: https://issues.apache.org/jira/browse/TIKA-2146
> Project: Tika
>  Issue Type: Bug
>  Components: core, parser
>Affects Versions: 1.11
> Environment: Windows 7
>Reporter: Sharath Kumar
> Attachments: Test bug.doc, This is password protected.doc
>
>
> When I try to parse a MS word document which is protected, I am unable to 
> extract the content rather, i get the below exception
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@29402a40
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.Tika.parseToString(Tika.java:537)
>   at 
> org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:102)
>   at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:1)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at org.elasticsearch.mapper.attachments.TikaImpl.parse(TikaImpl.java:99)
>   at 
> org.elasticsearch.mapper.attachments.AttachmentMapper.parse(AttachmentMapper.java:482)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:309)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:436)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:262)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:122)
>   at 
> org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:309)
>   at 
> org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:529)
>   at 
> org.elasticsearch.index.shard.IndexShard.prepareCreateOnPrimary(IndexShard.java:506)
>   at 
> org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:215)
>   at 
> org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:224)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
>   at 
> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
>   at 
> org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
>   at 
> org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
>   at 
> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArrayIndexOutOfBoundsException
>   at org.apache.poi.hwpf.model.SectionTable.(SectionTable.java:84)
>   at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:345)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)



--
Th

[jira] [Commented] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException

2016-10-28 Thread Sharath Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15614659#comment-15614659
 ] 

Sharath Kumar commented on TIKA-2146:
-

 Does tika support extracting the contents of a protected MS-word document. The 
document is however not a password protected though.

> Unable to extract contents from protected MS 
> word-doc-java.lang.ArrayIndexOutOfBoundsException
> --
>
> Key: TIKA-2146
> URL: https://issues.apache.org/jira/browse/TIKA-2146
> Project: Tika
>  Issue Type: Bug
>  Components: core, parser
>Affects Versions: 1.11
> Environment: Windows 7
>Reporter: Sharath Kumar
> Attachments: Test bug.doc, This is password protected.doc
>
>
> When I try to parse a MS word document which is protected, I am unable to 
> extract the content rather, i get the below exception
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@29402a40
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.Tika.parseToString(Tika.java:537)
>   at 
> org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:102)
>   at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:1)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at org.elasticsearch.mapper.attachments.TikaImpl.parse(TikaImpl.java:99)
>   at 
> org.elasticsearch.mapper.attachments.AttachmentMapper.parse(AttachmentMapper.java:482)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:309)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:436)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:262)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:122)
>   at 
> org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:309)
>   at 
> org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:529)
>   at 
> org.elasticsearch.index.shard.IndexShard.prepareCreateOnPrimary(IndexShard.java:506)
>   at 
> org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:215)
>   at 
> org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:224)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
>   at 
> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
>   at 
> org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
>   at 
> org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
>   at 
> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArrayIndexOutOfBoundsException
>   at org.apache.poi.hwpf.model.SectionTable.(SectionTable.java:84)
>   at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:345)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)



--
Th

[jira] [Issue Comment Deleted] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException

2016-10-28 Thread Sharath Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sharath Kumar updated TIKA-2146:

Comment: was deleted

(was:  Does tika support extracting the contents of a protected MS-word 
document. The document is however not a password protected though.)

> Unable to extract contents from protected MS 
> word-doc-java.lang.ArrayIndexOutOfBoundsException
> --
>
> Key: TIKA-2146
> URL: https://issues.apache.org/jira/browse/TIKA-2146
> Project: Tika
>  Issue Type: Bug
>  Components: core, parser
>Affects Versions: 1.11
> Environment: Windows 7
>Reporter: Sharath Kumar
> Attachments: Test bug.doc, This is password protected.doc
>
>
> When I try to parse a MS word document which is protected, I am unable to 
> extract the content rather, i get the below exception
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@29402a40
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.Tika.parseToString(Tika.java:537)
>   at 
> org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:102)
>   at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:1)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at org.elasticsearch.mapper.attachments.TikaImpl.parse(TikaImpl.java:99)
>   at 
> org.elasticsearch.mapper.attachments.AttachmentMapper.parse(AttachmentMapper.java:482)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:309)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:436)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:262)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:122)
>   at 
> org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:309)
>   at 
> org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:529)
>   at 
> org.elasticsearch.index.shard.IndexShard.prepareCreateOnPrimary(IndexShard.java:506)
>   at 
> org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:215)
>   at 
> org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:224)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
>   at 
> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
>   at 
> org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
>   at 
> org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
>   at 
> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArrayIndexOutOfBoundsException
>   at org.apache.poi.hwpf.model.SectionTable.(SectionTable.java:84)
>   at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:345)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)



--
This message was sent by At

[jira] [Resolved] (TIKA-2149) org.apache.poi.POIXMLDocumentPart cannot be cast to org.apache.poi.xwpf.usermodel.XWPFDocument - MS Word docx

2016-10-28 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2149.
---
Resolution: Duplicate

>  org.apache.poi.POIXMLDocumentPart cannot be cast to 
> org.apache.poi.xwpf.usermodel.XWPFDocument - MS Word docx
> --
>
> Key: TIKA-2149
> URL: https://issues.apache.org/jira/browse/TIKA-2149
> Project: Tika
>  Issue Type: Bug
>  Components: core, parser
>Affects Versions: 1.11, 1.13
> Environment: Windows 7 . Linux RHEL 7
>Reporter: Sharath Kumar
>
> When I run the attached document(.docx) against tika 1.11 or tika 1.13 to 
> extract contents, it errors out with the below exception
> Exception in thread "main" org.apache.tika.exception.TikaException: 
> Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser@1ea9f6af
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:191)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:480)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145)
> Caused by: java.lang.ClassCastException: org.apache.poi.POIXMLDocumentPart 
> cannot be cast to org.apache.poi.xwpf.usermodel.XWPFDocument
> at 
> org.apache.poi.xwpf.usermodel.XWPFFootnotes.getXWPFDocument(XWPFFootnotes.java:162)
> at 
> org.apache.poi.xwpf.usermodel.XWPFFootnote.(XWPFFootnote.java:47)
> at 
> org.apache.poi.xwpf.usermodel.XWPFFootnotes.onDocumentRead(XWPFFootnotes.java:95)
> at 
> org.apache.poi.POIXMLDocumentPart._invokeOnDocumentRead(POIXMLDocumentPart.java:658)
> at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:235)
> at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160)
> at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:124)
> at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
> at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237)
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ... 5 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2142) ArrayIndexOutOfBoundsException

2016-10-28 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15615404#comment-15615404
 ] 

Tim Allison commented on TIKA-2142:
---

https://bz.apache.org/bugzilla/show_bug.cgi?id=60305

> ArrayIndexOutOfBoundsException
> --
>
> Key: TIKA-2142
> URL: https://issues.apache.org/jira/browse/TIKA-2142
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: HPV8dHinge Confocal Results.ppt
>
>
> On the attached PowerPoint presentation, which opens fine with PowerPoint, 
> the Tika parser throws the following error:
> java.lang.ArrayIndexOutOfBoundsException
>   at java.lang.System.arraycopy(Native Method)
>   at 
> org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.readPictures(HSLFSlideShowImpl.java:438)
>   at 
> org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.getPictureData(HSLFSlideShowImpl.java:772)
>   at 
> org.apache.poi.hslf.usermodel.HSLFSlideShow.getPictureData(HSLFSlideShow.java:547)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:305)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2150) RTF TextExtractor omits some content

2016-10-28 Thread T. Schmidt (JIRA)
T. Schmidt created TIKA-2150:


 Summary: RTF TextExtractor omits some content
 Key: TIKA-2150
 URL: https://issues.apache.org/jira/browse/TIKA-2150
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.13
Reporter: T. Schmidt


The TextExtractor class seems to handle the first two content words (TO FROM) 
in the provided file as if they would belong to the header. They are missing in 
the text output .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2142) ArrayIndexOutOfBoundsException

2016-10-28 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15615448#comment-15615448
 ] 

Tim Allison commented on TIKA-2142:
---

Fixed in POI r1767023

> ArrayIndexOutOfBoundsException
> --
>
> Key: TIKA-2142
> URL: https://issues.apache.org/jira/browse/TIKA-2142
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: HPV8dHinge Confocal Results.ppt
>
>
> On the attached PowerPoint presentation, which opens fine with PowerPoint, 
> the Tika parser throws the following error:
> java.lang.ArrayIndexOutOfBoundsException
>   at java.lang.System.arraycopy(Native Method)
>   at 
> org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.readPictures(HSLFSlideShowImpl.java:438)
>   at 
> org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.getPictureData(HSLFSlideShowImpl.java:772)
>   at 
> org.apache.poi.hslf.usermodel.HSLFSlideShow.getPictureData(HSLFSlideShow.java:547)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:305)
>   at 
> org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2150) RTF TextExtractor omits some content

2016-10-28 Thread T. Schmidt (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

T. Schmidt updated TIKA-2150:
-
Attachment: bi16tabe.000

> RTF TextExtractor omits some content
> 
>
> Key: TIKA-2150
> URL: https://issues.apache.org/jira/browse/TIKA-2150
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
>Reporter: T. Schmidt
> Attachments: bi16tabe.000
>
>
> The TextExtractor class seems to handle the first two content words (TO FROM) 
> in the provided file as if they would belong to the header. They are missing 
> in the text output .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2144) NullPointerException on a valid Word file

2016-10-28 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2144.
---
Resolution: Fixed

> NullPointerException on a valid Word file
> -
>
> Key: TIKA-2144
> URL: https://issues.apache.org/jira/browse/TIKA-2144
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Proposal ID 17 Offeror ChromoLogic.docx
>
>
> On the attached Word file, which opens fine in Word, the Tika parser throws 
> the following error:
> java.lang.NullPointerException
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractParagraph(XWPFWordExtractorDecorator.java:149)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractIBodyText(XWPFWordExtractorDecorator.java:107)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.buildXHTML(XWPFWordExtractorDecorator.java:93)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2145) InvalidFormatException on a valid Word file

2016-10-28 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15615530#comment-15615530
 ] 

Tim Allison commented on TIKA-2145:
---

Fixed in POI https://bz.apache.org/bugzilla/show_bug.cgi?id=60315

> InvalidFormatException on a valid Word file
> ---
>
> Key: TIKA-2145
> URL: https://issues.apache.org/jira/browse/TIKA-2145
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: safety_analysis_report_FINAL2.docx
>
>
> On the attached Word file, which opens fine with Word, the Tika parser throws 
> the following exception:
> org.apache.tika.exception.TikaException: Error creating OOXML extractor
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:120)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
> Caused by: java.lang.IllegalArgumentException: Date for created could not be 
> parsed: 2015-07-27
>   at 
> org.apache.poi.openxml4j.opc.internal.PackagePropertiesPart.setCreatedProperty(PackagePropertiesPart.java:408)
>   at 
> org.apache.poi.openxml4j.opc.internal.unmarshallers.PackagePropertiesUnmarshaller.unmarshall(PackagePropertiesUnmarshaller.java:124)
>   at org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:743)
>   at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:230)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:69)
>   ... 3 more
> Caused by: org.apache.poi.openxml4j.exceptions.InvalidFormatException: Date 
> 2015-07-27 not well formatted, expected format in: -MM-dd'T'HH:mm:ssz, 
> -MM-dd'T'HH:mm:ss.SSSz, -MM-dd'T'HH:mm:ss'Z', 
> -MM-dd'T'HH:mm:ss.SS'Z'
>   at 
> org.apache.poi.openxml4j.opc.internal.PackagePropertiesPart.setDateValue(PackagePropertiesPart.java:615)
>   at 
> org.apache.poi.openxml4j.opc.internal.PackagePropertiesPart.setCreatedProperty(PackagePropertiesPart.java:406)
>   ... 7 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


tika-2.x-windows - Build # 68 - Still Failing

2016-10-28 Thread Apache Jenkins Server
The Apache Jenkins build system has built tika-2.x-windows (build #68)

Status: Still Failing

Check console output at https://builds.apache.org/job/tika-2.x-windows/68/ to 
view the results.

[jira] [Commented] (TIKA-2144) NullPointerException on a valid Word file

2016-10-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15615542#comment-15615542
 ] 

Hudson commented on TIKA-2144:
--

FAILURE: Integrated in Jenkins build tika-2.x-windows #68 (See 
[https://builds.apache.org/job/tika-2.x-windows/68/])
TIKA-2144 - avoid npe if styles doesn't exist (odd, indeed, but if (tallison: 
rev 4b393a6f9be5ed492ce4408ff12971bef82b4a14)
* (edit) 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFWordExtractorDecorator.java


> NullPointerException on a valid Word file
> -
>
> Key: TIKA-2144
> URL: https://issues.apache.org/jira/browse/TIKA-2144
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Proposal ID 17 Offeror ChromoLogic.docx
>
>
> On the attached Word file, which opens fine in Word, the Tika parser throws 
> the following error:
> java.lang.NullPointerException
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractParagraph(XWPFWordExtractorDecorator.java:149)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractIBodyText(XWPFWordExtractorDecorator.java:107)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.buildXHTML(XWPFWordExtractorDecorator.java:93)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-2149) org.apache.poi.POIXMLDocumentPart cannot be cast to org.apache.poi.xwpf.usermodel.XWPFDocument - MS Word docx

2016-10-28 Thread Sharath Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15615584#comment-15615584
 ] 

Sharath Kumar edited comment on TIKA-2149 at 10/28/16 2:37 PM:
---

Bug Tika-2147, the input document is a word template. However not in my case


was (Author: mnsk07):
Tika 2147, the input document is a word template. However not in my case

>  org.apache.poi.POIXMLDocumentPart cannot be cast to 
> org.apache.poi.xwpf.usermodel.XWPFDocument - MS Word docx
> --
>
> Key: TIKA-2149
> URL: https://issues.apache.org/jira/browse/TIKA-2149
> Project: Tika
>  Issue Type: Bug
>  Components: core, parser
>Affects Versions: 1.11, 1.13
> Environment: Windows 7 . Linux RHEL 7
>Reporter: Sharath Kumar
>
> When I run the attached document(.docx) against tika 1.11 or tika 1.13 to 
> extract contents, it errors out with the below exception
> Exception in thread "main" org.apache.tika.exception.TikaException: 
> Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser@1ea9f6af
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:191)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:480)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145)
> Caused by: java.lang.ClassCastException: org.apache.poi.POIXMLDocumentPart 
> cannot be cast to org.apache.poi.xwpf.usermodel.XWPFDocument
> at 
> org.apache.poi.xwpf.usermodel.XWPFFootnotes.getXWPFDocument(XWPFFootnotes.java:162)
> at 
> org.apache.poi.xwpf.usermodel.XWPFFootnote.(XWPFFootnote.java:47)
> at 
> org.apache.poi.xwpf.usermodel.XWPFFootnotes.onDocumentRead(XWPFFootnotes.java:95)
> at 
> org.apache.poi.POIXMLDocumentPart._invokeOnDocumentRead(POIXMLDocumentPart.java:658)
> at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:235)
> at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160)
> at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:124)
> at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
> at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237)
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ... 5 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2149) org.apache.poi.POIXMLDocumentPart cannot be cast to org.apache.poi.xwpf.usermodel.XWPFDocument - MS Word docx

2016-10-28 Thread Sharath Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15615584#comment-15615584
 ] 

Sharath Kumar commented on TIKA-2149:
-

Tika 2147, the input document is a word template. However not in my case

>  org.apache.poi.POIXMLDocumentPart cannot be cast to 
> org.apache.poi.xwpf.usermodel.XWPFDocument - MS Word docx
> --
>
> Key: TIKA-2149
> URL: https://issues.apache.org/jira/browse/TIKA-2149
> Project: Tika
>  Issue Type: Bug
>  Components: core, parser
>Affects Versions: 1.11, 1.13
> Environment: Windows 7 . Linux RHEL 7
>Reporter: Sharath Kumar
>
> When I run the attached document(.docx) against tika 1.11 or tika 1.13 to 
> extract contents, it errors out with the below exception
> Exception in thread "main" org.apache.tika.exception.TikaException: 
> Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser@1ea9f6af
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:191)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:480)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145)
> Caused by: java.lang.ClassCastException: org.apache.poi.POIXMLDocumentPart 
> cannot be cast to org.apache.poi.xwpf.usermodel.XWPFDocument
> at 
> org.apache.poi.xwpf.usermodel.XWPFFootnotes.getXWPFDocument(XWPFFootnotes.java:162)
> at 
> org.apache.poi.xwpf.usermodel.XWPFFootnote.(XWPFFootnote.java:47)
> at 
> org.apache.poi.xwpf.usermodel.XWPFFootnotes.onDocumentRead(XWPFFootnotes.java:95)
> at 
> org.apache.poi.POIXMLDocumentPart._invokeOnDocumentRead(POIXMLDocumentPart.java:658)
> at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:235)
> at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160)
> at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:124)
> at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
> at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237)
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ... 5 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2147) ClassCastException on a valid Word template

2016-10-28 Thread Sharath Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15615588#comment-15615588
 ] 

Sharath Kumar commented on TIKA-2147:
-

I get the similar issue for docx too . I have attached the document which can 
reproduce the issue

> ClassCastException on a valid Word template
> ---
>
> Key: TIKA-2147
> URL: https://issues.apache.org/jira/browse/TIKA-2147
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Forefront Fax.dotx
>
>
> On the attached document template, which opens fine in Word, the Tika parser 
> throws the following error:
> java.lang.ClassCastException: org.apache.poi.POIXMLDocumentPart cannot be 
> cast to org.apache.poi.xwpf.usermodel.XWPFDocument
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFootnotes.getXWPFDocument(XWPFFootnotes.java:162)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFootnote.(XWPFFootnote.java:47)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFootnotes.onDocumentRead(XWPFFootnotes.java:95)
>   at 
> org.apache.poi.POIXMLDocumentPart._invokeOnDocumentRead(POIXMLDocumentPart.java:658)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:235)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:124)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2147) ClassCastException on a valid Word template

2016-10-28 Thread Sharath Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sharath Kumar updated TIKA-2147:

Attachment: basicresume.docx

> ClassCastException on a valid Word template
> ---
>
> Key: TIKA-2147
> URL: https://issues.apache.org/jira/browse/TIKA-2147
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Forefront Fax.dotx, basicresume.docx
>
>
> On the attached document template, which opens fine in Word, the Tika parser 
> throws the following error:
> java.lang.ClassCastException: org.apache.poi.POIXMLDocumentPart cannot be 
> cast to org.apache.poi.xwpf.usermodel.XWPFDocument
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFootnotes.getXWPFDocument(XWPFFootnotes.java:162)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFootnote.(XWPFFootnote.java:47)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFootnotes.onDocumentRead(XWPFFootnotes.java:95)
>   at 
> org.apache.poi.POIXMLDocumentPart._invokeOnDocumentRead(POIXMLDocumentPart.java:658)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:235)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:124)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2150) RTF TextExtractor omits some content

2016-10-28 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15615786#comment-15615786
 ] 

Tim Allison commented on TIKA-2150:
---

Thank you for opening this and submitting a minimal file and even diagnosing 
the problem!  I'm not yet sure how best to fix this. We rely on "texty" signals 
"par", etc to determine that we're no longer in the header.  I worry that going 
for a stricter parse will have unintended consequences.  I'll dig some more.  
Thank you, again.

> RTF TextExtractor omits some content
> 
>
> Key: TIKA-2150
> URL: https://issues.apache.org/jira/browse/TIKA-2150
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
>Reporter: T. Schmidt
> Attachments: bi16tabe.000
>
>
> The TextExtractor class seems to handle the first two content words (TO FROM) 
> in the provided file as if they would belong to the header. They are missing 
> in the text output .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2147) ClassCastException on a valid Word template

2016-10-28 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15615862#comment-15615862
 ] 

Tim Allison commented on TIKA-2147:
---

https://bz.apache.org/bugzilla/show_bug.cgi?id=60316

> ClassCastException on a valid Word template
> ---
>
> Key: TIKA-2147
> URL: https://issues.apache.org/jira/browse/TIKA-2147
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Forefront Fax.dotx, basicresume.docx
>
>
> On the attached document template, which opens fine in Word, the Tika parser 
> throws the following error:
> java.lang.ClassCastException: org.apache.poi.POIXMLDocumentPart cannot be 
> cast to org.apache.poi.xwpf.usermodel.XWPFDocument
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFootnotes.getXWPFDocument(XWPFFootnotes.java:162)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFootnote.(XWPFFootnote.java:47)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFootnotes.onDocumentRead(XWPFFootnotes.java:95)
>   at 
> org.apache.poi.POIXMLDocumentPart._invokeOnDocumentRead(POIXMLDocumentPart.java:658)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:235)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:124)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2147) ClassCastException on a valid Word template

2016-10-28 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15615870#comment-15615870
 ] 

Tim Allison commented on TIKA-2147:
---

Great.  Thank you.  My proposed fix works on both docs.  Will wait for feedback 
from POI colleagues before committing fix in POI.

> ClassCastException on a valid Word template
> ---
>
> Key: TIKA-2147
> URL: https://issues.apache.org/jira/browse/TIKA-2147
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: Forefront Fax.dotx, basicresume.docx
>
>
> On the attached document template, which opens fine in Word, the Tika parser 
> throws the following error:
> java.lang.ClassCastException: org.apache.poi.POIXMLDocumentPart cannot be 
> cast to org.apache.poi.xwpf.usermodel.XWPFDocument
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFootnotes.getXWPFDocument(XWPFFootnotes.java:162)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFootnote.(XWPFFootnote.java:47)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFFootnotes.onDocumentRead(XWPFFootnotes.java:95)
>   at 
> org.apache.poi.POIXMLDocumentPart._invokeOnDocumentRead(POIXMLDocumentPart.java:658)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:235)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:124)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2144) NullPointerException on a valid Word file

2016-10-28 Thread Seva Alekseyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seva Alekseyev updated TIKA-2144:
-
Attachment: (was: Proposal ID 17 Offeror ChromoLogic.docx)

> NullPointerException on a valid Word file
> -
>
> Key: TIKA-2144
> URL: https://issues.apache.org/jira/browse/TIKA-2144
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
>
> On the attached Word file, which opens fine in Word, the Tika parser throws 
> the following error:
> java.lang.NullPointerException
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractParagraph(XWPFWordExtractorDecorator.java:149)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractIBodyText(XWPFWordExtractorDecorator.java:107)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.buildXHTML(XWPFWordExtractorDecorator.java:93)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2144) NullPointerException on a valid Word file

2016-10-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15615990#comment-15615990
 ] 

Hudson commented on TIKA-2144:
--

FAILURE: Integrated in Jenkins build tika-2.x #166 (See 
[https://builds.apache.org/job/tika-2.x/166/])
TIKA-2144 - avoid npe if styles doesn't exist (odd, indeed, but if (tallison: 
rev 4b393a6f9be5ed492ce4408ff12971bef82b4a14)
* (edit) 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFWordExtractorDecorator.java


> NullPointerException on a valid Word file
> -
>
> Key: TIKA-2144
> URL: https://issues.apache.org/jira/browse/TIKA-2144
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
>
> On the attached Word file, which opens fine in Word, the Tika parser throws 
> the following error:
> java.lang.NullPointerException
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractParagraph(XWPFWordExtractorDecorator.java:149)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractIBodyText(XWPFWordExtractorDecorator.java:107)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.buildXHTML(XWPFWordExtractorDecorator.java:93)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


tika-2.x - Build # 166 - Failure

2016-10-28 Thread Apache Jenkins Server
The Apache Jenkins build system has built tika-2.x (build #166)

Status: Failure

Check console output at https://builds.apache.org/job/tika-2.x/166/ to view the 
results.

[jira] [Commented] (TIKA-2144) NullPointerException on a valid Word file

2016-10-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15616001#comment-15616001
 ] 

Hudson commented on TIKA-2144:
--

FAILURE: Integrated in Jenkins build Tika-trunk #1128 (See 
[https://builds.apache.org/job/Tika-trunk/1128/])
TIKA-2144 - avoid npe if styles doesn't exist (odd, indeed, but if (tallison: 
rev 01163e23cc9d1701e4a23f6cb13771a31aa99f08)
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFWordExtractorDecorator.java


> NullPointerException on a valid Word file
> -
>
> Key: TIKA-2144
> URL: https://issues.apache.org/jira/browse/TIKA-2144
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
>
> On the attached Word file, which opens fine in Word, the Tika parser throws 
> the following error:
> java.lang.NullPointerException
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractParagraph(XWPFWordExtractorDecorator.java:149)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractIBodyText(XWPFWordExtractorDecorator.java:107)
>   at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.buildXHTML(XWPFWordExtractorDecorator.java:93)
>   at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException

2016-10-28 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15616180#comment-15616180
 ] 

Tim Allison commented on TIKA-2146:
---

I wonder if these errors are caused by what I found with old "protected" Excel 
files.  Even though they weren't password protected, they were still 
"protected", and the inner objects were encrypted to the point that even the 
record lengths were unreadable, leading to aioobe and other similar problems.

> Unable to extract contents from protected MS 
> word-doc-java.lang.ArrayIndexOutOfBoundsException
> --
>
> Key: TIKA-2146
> URL: https://issues.apache.org/jira/browse/TIKA-2146
> Project: Tika
>  Issue Type: Bug
>  Components: core, parser
>Affects Versions: 1.11
> Environment: Windows 7
>Reporter: Sharath Kumar
> Attachments: Test bug.doc, This is password protected.doc
>
>
> When I try to parse a MS word document which is protected, I am unable to 
> extract the content rather, i get the below exception
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@29402a40
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.Tika.parseToString(Tika.java:537)
>   at 
> org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:102)
>   at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:1)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at org.elasticsearch.mapper.attachments.TikaImpl.parse(TikaImpl.java:99)
>   at 
> org.elasticsearch.mapper.attachments.AttachmentMapper.parse(AttachmentMapper.java:482)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:309)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:436)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:262)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:122)
>   at 
> org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:309)
>   at 
> org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:529)
>   at 
> org.elasticsearch.index.shard.IndexShard.prepareCreateOnPrimary(IndexShard.java:506)
>   at 
> org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:215)
>   at 
> org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:224)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
>   at 
> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
>   at 
> org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
>   at 
> org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
>   at 
> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArrayIndexOutOfBoundsException
>   at org.apache.poi.hwpf.model.SectionTable.(SectionTable.java:84)
>   at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:345)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
>   at 
> 

[jira] [Commented] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException

2016-10-28 Thread Frank Refol (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15616341#comment-15616341
 ] 

Frank Refol commented on TIKA-2146:
---

Thanks for clarifying and providing that link. That is very helpful in giving 
insight on what is available in Tika with decrypting MS Office docs.

> Unable to extract contents from protected MS 
> word-doc-java.lang.ArrayIndexOutOfBoundsException
> --
>
> Key: TIKA-2146
> URL: https://issues.apache.org/jira/browse/TIKA-2146
> Project: Tika
>  Issue Type: Bug
>  Components: core, parser
>Affects Versions: 1.11
> Environment: Windows 7
>Reporter: Sharath Kumar
> Attachments: Test bug.doc, This is password protected.doc
>
>
> When I try to parse a MS word document which is protected, I am unable to 
> extract the content rather, i get the below exception
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@29402a40
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.Tika.parseToString(Tika.java:537)
>   at 
> org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:102)
>   at org.elasticsearch.mapper.attachments.TikaImpl$1.run(TikaImpl.java:1)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at org.elasticsearch.mapper.attachments.TikaImpl.parse(TikaImpl.java:99)
>   at 
> org.elasticsearch.mapper.attachments.AttachmentMapper.parse(AttachmentMapper.java:482)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:309)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:436)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:262)
>   at 
> org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:122)
>   at 
> org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:309)
>   at 
> org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:529)
>   at 
> org.elasticsearch.index.shard.IndexShard.prepareCreateOnPrimary(IndexShard.java:506)
>   at 
> org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:215)
>   at 
> org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:224)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardUpdateOperation(TransportShardBulkAction.java:389)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:191)
>   at 
> org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
>   at 
> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
>   at 
> org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
>   at 
> org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
>   at 
> org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
>   at 
> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArrayIndexOutOfBoundsException
>   at org.apache.poi.hwpf.model.SectionTable.(SectionTable.java:84)
>   at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:345)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280