[jira] [Comment Edited] (TIKA-1836) Convertion DOC->TXT failed due to POI issue
[ https://issues.apache.org/jira/browse/TIKA-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15216679#comment-15216679 ] Tim Allison edited comment on TIKA-1836 at 3/29/16 7:16 PM: No problem. 1.12 was cut in January before the upgrade in POI. This is fixed in trunk/1.13...I just confirmed. I didn't add a test because the test file at 20kb felt too big for the rarity (presumed) of this issue. I can add a test in Tika if we want it. was (Author: talli...@mitre.org): No problem. 1.12 was cut in January before we the upgrade in POI. This is fixed in trunk/1.13...I just confirmed. I didn't add a test because the test file at 20kb felt too big for the rarity (presumed) of this issue. I can add a test in Tika if we want it. > Convertion DOC->TXT failed due to POI issue > --- > > Key: TIKA-1836 > URL: https://issues.apache.org/jira/browse/TIKA-1836 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.11 > Environment: Distributor ID: Ubuntu > Description: Ubuntu 12.04.5 LTS > Release: 12.04 > Codename: precise > java version "1.7.0_91" > OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.12.04.1) > OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode) >Reporter: Jorge Spinsanti > Fix For: 1.13 > > Attachments: test.doc > > > When we try to convert DOC -> TXT, I got the next stack trace: > {code} > Caused by: org.apache.tika.exception.TikaException: Unexpected > RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1ddeedb6 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 15 more > Caused by: java.lang.UnsupportedOperationException: Non-extended character > Pascal strings are not supported right now. Please, contact POI developers > for update. > at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:82) > at org.apache.poi.hwpf.model.Sttb.(Sttb.java:61) > at > org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52) > at org.apache.poi.hwpf.model.SavedByTable.(SavedByTable.java:53) > at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:361) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 22 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1836) Convertion DOC->TXT failed due to POI issue
[ https://issues.apache.org/jira/browse/TIKA-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15216679#comment-15216679 ] Tim Allison edited comment on TIKA-1836 at 3/29/16 7:15 PM: No problem. 1.12 was cut in January before we the upgrade in POI. This is fixed in trunk/1.13...I just confirmed. I didn't add a test because the test file at 20kb felt too big for the rarity (presumed) of this issue. I can add a test in Tika if we want it. was (Author: talli...@mitre.org): No problem. 1.12 was cut in January. This is fixed in trunk/1.13...I just confirmed. I didn't add a test because the test file at 20kb felt too big for the rarity (presumed) of this issue. I can add a test in Tika if we want it. > Convertion DOC->TXT failed due to POI issue > --- > > Key: TIKA-1836 > URL: https://issues.apache.org/jira/browse/TIKA-1836 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.11 > Environment: Distributor ID: Ubuntu > Description: Ubuntu 12.04.5 LTS > Release: 12.04 > Codename: precise > java version "1.7.0_91" > OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.12.04.1) > OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode) >Reporter: Jorge Spinsanti > Fix For: 1.13 > > Attachments: test.doc > > > When we try to convert DOC -> TXT, I got the next stack trace: > {code} > Caused by: org.apache.tika.exception.TikaException: Unexpected > RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1ddeedb6 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 15 more > Caused by: java.lang.UnsupportedOperationException: Non-extended character > Pascal strings are not supported right now. Please, contact POI developers > for update. > at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:82) > at org.apache.poi.hwpf.model.Sttb.(Sttb.java:61) > at > org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52) > at org.apache.poi.hwpf.model.SavedByTable.(SavedByTable.java:53) > at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:361) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 22 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1836) Convertion DOC->TXT failed due to POI issue
[ https://issues.apache.org/jira/browse/TIKA-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15107212#comment-15107212 ] Jorge Spinsanti edited comment on TIKA-1836 at 1/19/16 7:08 PM: POI issue was report in 2014-08-22. Perhaps if TIKA (other Apache project) needs the fix, TIKA team can push to increase the priority/importance. was (Author: giorgy): POI issue was report in 2014-08-22. Perhaps if TIKA needs the fix, TIKA team can push to increase the priority/importance. > Convertion DOC->TXT failed due to POI issue > --- > > Key: TIKA-1836 > URL: https://issues.apache.org/jira/browse/TIKA-1836 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.11 > Environment: Distributor ID: Ubuntu > Description: Ubuntu 12.04.5 LTS > Release: 12.04 > Codename: precise > java version "1.7.0_91" > OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.12.04.1) > OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode) >Reporter: Jorge Spinsanti > Attachments: test.doc > > > When we try to convert DOC -> TXT, I got the next stack trace: > {code} > Caused by: org.apache.tika.exception.TikaException: Unexpected > RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1ddeedb6 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 15 more > Caused by: java.lang.UnsupportedOperationException: Non-extended character > Pascal strings are not supported right now. Please, contact POI developers > for update. > at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:82) > at org.apache.poi.hwpf.model.Sttb.(Sttb.java:61) > at > org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52) > at org.apache.poi.hwpf.model.SavedByTable.(SavedByTable.java:53) > at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:361) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 22 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1836) Convertion DOC->TXT failed due to POI issue
[ https://issues.apache.org/jira/browse/TIKA-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15106919#comment-15106919 ] Jorge Spinsanti edited comment on TIKA-1836 at 1/19/16 7:04 PM: POI is a dependency of TIKA. I think TIKA can evaluate to migrate the use of POI to new version. Or perhaps, TIKA can be manage this issue trying an alternative idea. was (Author: giorgy): POI is a dependency of TIKA. I think TIKA can be evaluate to migrate the use of POI to new version. Or perhaps, TIKA can be manage this issue trying an alternative idea. > Convertion DOC->TXT failed due to POI issue > --- > > Key: TIKA-1836 > URL: https://issues.apache.org/jira/browse/TIKA-1836 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.11 > Environment: Distributor ID: Ubuntu > Description: Ubuntu 12.04.5 LTS > Release: 12.04 > Codename: precise > java version "1.7.0_91" > OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.12.04.1) > OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode) >Reporter: Jorge Spinsanti > Attachments: test.doc > > > When we try to convert DOC -> TXT, I got the next stack trace: > {code} > Caused by: org.apache.tika.exception.TikaException: Unexpected > RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1ddeedb6 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 15 more > Caused by: java.lang.UnsupportedOperationException: Non-extended character > Pascal strings are not supported right now. Please, contact POI developers > for update. > at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:82) > at org.apache.poi.hwpf.model.Sttb.(Sttb.java:61) > at > org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52) > at org.apache.poi.hwpf.model.SavedByTable.(SavedByTable.java:53) > at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:361) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 22 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1836) Convertion DOC->TXT failed due to POI issue
[ https://issues.apache.org/jira/browse/TIKA-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15107067#comment-15107067 ] Tim Allison edited comment on TIKA-1836 at 1/19/16 5:57 PM: I concur with Ken, if I understand this correctly, we can't do anything at the Tika level to prevent this from happening. I also agree with Phil [0], though, that we should probably catch and log this in POI rather than preventing the extraction from the entire document. Mind opening an issue in POI's bugzilla and add a link to this issue? I'll see what I can do...prob won't be until early next week, and then we'll have to wait for the next version of POI before we'll see different behavior in Tika. -Or, has this already been fixed in POI [1]? If so, we'll be updating soon (TIKA-1799) once the transfer to git has finished.- [0] https://mail-archives.apache.org/mod_mbox/poi-user/201303.mbox/%3CCABWW=XW_YBOG-FtJ2Jqu+tqXU85Vk_8nUP=L=vq_mn4ng2c...@mail.gmail.com%3E [1] http://comments.gmane.org/gmane.comp.jakarta.poi.devel/27039 was (Author: talli...@mitre.org): I concur with Ken, if I understand this correctly, we can't do anything at the Tika level to prevent this from happening. I also agree with Phil [0], though, that we should probably catch and log this in POI rather than preventing the extraction from the entire document. Mind opening an issue in POI's bugzilla and add a link to this issue? I'll see what I can do...prob won't be until early next week, and then we'll have to wait for the next version of POI before we'll see different behavior in Tika. Or, has this already been fixed in POI [1]? If so, we'll be updating soon (TIKA-1799) once the transfer to git has finished. [0] https://mail-archives.apache.org/mod_mbox/poi-user/201303.mbox/%3CCABWW=XW_YBOG-FtJ2Jqu+tqXU85Vk_8nUP=L=vq_mn4ng2c...@mail.gmail.com%3E [1] http://comments.gmane.org/gmane.comp.jakarta.poi.devel/27039 > Convertion DOC->TXT failed due to POI issue > --- > > Key: TIKA-1836 > URL: https://issues.apache.org/jira/browse/TIKA-1836 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.11 > Environment: Distributor ID: Ubuntu > Description: Ubuntu 12.04.5 LTS > Release: 12.04 > Codename: precise > java version "1.7.0_91" > OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.12.04.1) > OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode) >Reporter: Jorge Spinsanti > Attachments: test.doc > > > When we try to convert DOC -> TXT, I got the next stack trace: > {code} > Caused by: org.apache.tika.exception.TikaException: Unexpected > RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1ddeedb6 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 15 more > Caused by: java.lang.UnsupportedOperationException: Non-extended character > Pascal strings are not supported right now. Please, contact POI developers > for update. > at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:82) > at org.apache.poi.hwpf.model.Sttb.(Sttb.java:61) > at > org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52) > at org.apache.poi.hwpf.model.SavedByTable.(SavedByTable.java:53) > at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:361) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 22 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1836) Convertion DOC->TXT failed due to POI issue
[ https://issues.apache.org/jira/browse/TIKA-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15107067#comment-15107067 ] Tim Allison edited comment on TIKA-1836 at 1/19/16 5:50 PM: I concur with Ken, if I understand this correctly, we can't do anything at the Tika level to prevent this from happening. I also agree with Phil [0], though, that we should probably catch and log this in POI rather than preventing the extraction from the entire document. Mind opening an issue in POI's bugzilla and add a link to this issue? I'll see what I can do...prob won't be until early next week, and then we'll have to wait for the next version of POI before we'll see different behavior in Tika. Or, has this already been fixed in POI [1]? If so, we'll be updating soon (TIKA-1799) once the transfer to git has finished. [0] https://mail-archives.apache.org/mod_mbox/poi-user/201303.mbox/%3CCABWW=XW_YBOG-FtJ2Jqu+tqXU85Vk_8nUP=L=vq_mn4ng2c...@mail.gmail.com%3E [1] http://comments.gmane.org/gmane.comp.jakarta.poi.devel/27039 was (Author: talli...@mitre.org): I concur with Ken, if I understand this correctly, we can't do anything at the Tika level to prevent this from happening. I also agree with Phil [0], though, that we should probably catch and log this in POI rather than preventing the extraction from the entire document. Mind opening an issue in POI's bugzilla and add a link to this issue? I'll see what I can do...prob won't be until early next week, and then we'll have to wait for the next version of POI before we'll see different behavior in Tika. [0] https://mail-archives.apache.org/mod_mbox/poi-user/201303.mbox/%3CCABWW=XW_YBOG-FtJ2Jqu+tqXU85Vk_8nUP=L=vq_mn4ng2c...@mail.gmail.com%3E > Convertion DOC->TXT failed due to POI issue > --- > > Key: TIKA-1836 > URL: https://issues.apache.org/jira/browse/TIKA-1836 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.11 > Environment: Distributor ID: Ubuntu > Description: Ubuntu 12.04.5 LTS > Release: 12.04 > Codename: precise > java version "1.7.0_91" > OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.12.04.1) > OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode) >Reporter: Jorge Spinsanti > Attachments: test.doc > > > When we try to convert DOC -> TXT, I got the next stack trace: > {code} > Caused by: org.apache.tika.exception.TikaException: Unexpected > RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1ddeedb6 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 15 more > Caused by: java.lang.UnsupportedOperationException: Non-extended character > Pascal strings are not supported right now. Please, contact POI developers > for update. > at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:82) > at org.apache.poi.hwpf.model.Sttb.(Sttb.java:61) > at > org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52) > at org.apache.poi.hwpf.model.SavedByTable.(SavedByTable.java:53) > at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:361) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 22 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)