[ 
https://issues.apache.org/jira/browse/TIKA-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15107067#comment-15107067
 ] 

Tim Allison edited comment on TIKA-1836 at 1/19/16 5:57 PM:
------------------------------------------------------------

I concur with Ken, if I understand this correctly, we can't do anything at the 
Tika level to prevent this from happening.  I also agree with Phil [0], though, 
that we should probably catch and log this in POI rather than preventing the 
extraction from the entire document.  Mind opening an issue in POI's bugzilla 
and add a link to this issue?  I'll see what I can do...prob won't be until 
early next week, and then we'll have to wait for the next version of POI before 
we'll see different behavior in Tika.

-Or, has this already been fixed in POI [1]?  If so, we'll be updating soon 
(TIKA-1799) once the transfer to git has finished.-

[0] 
https://mail-archives.apache.org/mod_mbox/poi-user/201303.mbox/%3CCABWW=XW_YBOG-FtJ2Jqu+tqXU85Vk_8nUP=L=vq_mn4ng2c...@mail.gmail.com%3E

[1] http://comments.gmane.org/gmane.comp.jakarta.poi.devel/27039


was (Author: talli...@mitre.org):
I concur with Ken, if I understand this correctly, we can't do anything at the 
Tika level to prevent this from happening.  I also agree with Phil [0], though, 
that we should probably catch and log this in POI rather than preventing the 
extraction from the entire document.  Mind opening an issue in POI's bugzilla 
and add a link to this issue?  I'll see what I can do...prob won't be until 
early next week, and then we'll have to wait for the next version of POI before 
we'll see different behavior in Tika.

Or, has this already been fixed in POI [1]?  If so, we'll be updating soon 
(TIKA-1799) once the transfer to git has finished.

[0] 
https://mail-archives.apache.org/mod_mbox/poi-user/201303.mbox/%3CCABWW=XW_YBOG-FtJ2Jqu+tqXU85Vk_8nUP=L=vq_mn4ng2c...@mail.gmail.com%3E

[1] http://comments.gmane.org/gmane.comp.jakarta.poi.devel/27039

> Convertion DOC->TXT failed due to POI issue
> -------------------------------------------
>
>                 Key: TIKA-1836
>                 URL: https://issues.apache.org/jira/browse/TIKA-1836
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.11
>         Environment: Distributor ID:  Ubuntu
> Description:  Ubuntu 12.04.5 LTS
> Release:      12.04
> Codename:     precise
> java version "1.7.0_91"
> OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.12.04.1)
> OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode)
>            Reporter: Jorge Spinsanti
>         Attachments: test.doc
>
>
> When we try to convert DOC -> TXT, I got the next stack trace:
> {code}
> Caused by: org.apache.tika.exception.TikaException: Unexpected 
> RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1ddeedb6
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>       ... 15 more
> Caused by: java.lang.UnsupportedOperationException: Non-extended character 
> Pascal strings are not supported right now. Please, contact POI developers 
> for update.
>       at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:82)
>       at org.apache.poi.hwpf.model.Sttb.<init>(Sttb.java:61)
>       at 
> org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52)
>       at org.apache.poi.hwpf.model.SavedByTable.<init>(SavedByTable.java:53)
>       at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:361)
>       at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144)
>       at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
>       at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>       ... 22 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to