[jira] [Created] (TIKA-1835) LinkContentHandler skips iframe and rel tags
Markus Jelsma created TIKA-1835: --- Summary: LinkContentHandler skips iframe and rel tags Key: TIKA-1835 URL: https://issues.apache.org/jira/browse/TIKA-1835 Project: Tika Issue Type: Bug Components: core Affects Versions: 1.11 Reporter: Markus Jelsma Fix For: 1.12 As simple as it gets, link and iframe tags were never implemented in LinkContentHandler. NUTCH-1233 kind of requires it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1823) Support detecting DWF format
[ https://issues.apache.org/jira/browse/TIKA-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luca Moretti updated TIKA-1823: --- Attachment: blocks_and_tables.dwf I found this file on the Autodesk website that could be a suitably licensed sample. The file can be found at the following location: https://knowledge.autodesk.com/support/autocad/downloads/caas/downloads/content/autocad-sample-files.html > Support detecting DWF format > > > Key: TIKA-1823 > URL: https://issues.apache.org/jira/browse/TIKA-1823 > Project: Tika > Issue Type: Improvement > Components: detector, mime >Reporter: Luca Moretti >Priority: Minor > Labels: detection, dwf, mime > Attachments: blocks_and_tables.dwf > > > Tika currently detects dwf files as application/octect-stream. > To make Tika mime magic detector correctly recognize dwf files it should be > added this code fragment in _tika-mimetypes.xml_ registry: > {code:xml} > > dwf > <_comment>Design Web Format > > > > > > > > > > {code} > \\ > In current version (DWF 6.0), dwf file is a ZIP-compressed container for > vector-based CAD drawings. It is basically a ZIP archive with the _(DWF > V06.00)_ signature added before the regular ZIP magic number. For this > reason, the match value to detect dwf files should be: {{(DWF V06.00)PK}}. > In the previous versions, the dwf data transport isn't a ZIP file format, so > the magic number is only the _(DWF V00.55)_ signature in the file header. > To make Tika detect dwf files with this version too I propose the match value > in the code above. > Thanks, > Luca > \\ > P.S.: The DWF format specification is included in the DWF Toolkit. The DWF > Toolkit is available for free at [http://www.autodesk.com/dwftoolkit] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1799) Upgrade to POI 3.14-Beta1 when available
[ https://issues.apache.org/jira/browse/TIKA-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107368#comment-15107368 ] Tim Allison commented on TIKA-1799: --- [~kiwiwings], looks like we have to specify packages after *.office, word, powerpoint, etc. The bundle build works in Tika with just powerpoint and word set to optional, should we add visio, excel, etc? > Upgrade to POI 3.14-Beta1 when available > > > Key: TIKA-1799 > URL: https://issues.apache.org/jira/browse/TIKA-1799 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Attachments: 349008.ppt, 349008.ppt.json > > > Should be out in the next week or two. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1799) Upgrade to POI 3.14-Beta1 when available
[ https://issues.apache.org/jira/browse/TIKA-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107395#comment-15107395 ] Bob Paulin commented on TIKA-1799: -- Actually I'd be careful using the wildcard here because I think poi-ooxml-scheme provides the visio, office and excel packages. So I don't think they should be optional. > Upgrade to POI 3.14-Beta1 when available > > > Key: TIKA-1799 > URL: https://issues.apache.org/jira/browse/TIKA-1799 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Attachments: 349008.ppt, 349008.ppt.json > > > Should be out in the next week or two. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1799) Upgrade to POI 3.14-Beta1 when available
[ https://issues.apache.org/jira/browse/TIKA-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107392#comment-15107392 ] Bob Paulin commented on TIKA-1799: -- So it's actually a pretty interesting question. If you wanted to make all the subpackages of com.microsoft.schemas.office optional you should be able to do: {code} com.microsoft.schemas.office.*;resolution:=optional, {code} All of these settings are based on BND http://www.aqute.biz/Bnd/Bnd . Not sure we want to wildcard in this case but I believe that would also work. All things equal I prefer explicitly listing optional packages. > Upgrade to POI 3.14-Beta1 when available > > > Key: TIKA-1799 > URL: https://issues.apache.org/jira/browse/TIKA-1799 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Attachments: 349008.ppt, 349008.ppt.json > > > Should be out in the next week or two. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1836) Convertion DOC->TXT failed due to POI issue
[ https://issues.apache.org/jira/browse/TIKA-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107216#comment-15107216 ] Tim Allison commented on TIKA-1836: --- Y, done. I asked POI colleagues if they minded if we logged this instead of throwing an exception. If there are no dissenting opinions, I'll make the change in POI early next week. > Convertion DOC->TXT failed due to POI issue > --- > > Key: TIKA-1836 > URL: https://issues.apache.org/jira/browse/TIKA-1836 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.11 > Environment: Distributor ID: Ubuntu > Description: Ubuntu 12.04.5 LTS > Release: 12.04 > Codename: precise > java version "1.7.0_91" > OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.12.04.1) > OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode) >Reporter: Jorge Spinsanti > Attachments: test.doc > > > When we try to convert DOC -> TXT, I got the next stack trace: > {code} > Caused by: org.apache.tika.exception.TikaException: Unexpected > RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1ddeedb6 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 15 more > Caused by: java.lang.UnsupportedOperationException: Non-extended character > Pascal strings are not supported right now. Please, contact POI developers > for update. > at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:82) > at org.apache.poi.hwpf.model.Sttb.(Sttb.java:61) > at > org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52) > at org.apache.poi.hwpf.model.SavedByTable.(SavedByTable.java:53) > at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:361) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 22 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1836) Convertion DOC->TXT failed due to POI issue
[ https://issues.apache.org/jira/browse/TIKA-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107067#comment-15107067 ] Tim Allison edited comment on TIKA-1836 at 1/19/16 5:57 PM: I concur with Ken, if I understand this correctly, we can't do anything at the Tika level to prevent this from happening. I also agree with Phil [0], though, that we should probably catch and log this in POI rather than preventing the extraction from the entire document. Mind opening an issue in POI's bugzilla and add a link to this issue? I'll see what I can do...prob won't be until early next week, and then we'll have to wait for the next version of POI before we'll see different behavior in Tika. -Or, has this already been fixed in POI [1]? If so, we'll be updating soon (TIKA-1799) once the transfer to git has finished.- [0] https://mail-archives.apache.org/mod_mbox/poi-user/201303.mbox/%3CCABWW=XW_YBOG-FtJ2Jqu+tqXU85Vk_8nUP=L=vq_mn4ng2c...@mail.gmail.com%3E [1] http://comments.gmane.org/gmane.comp.jakarta.poi.devel/27039 was (Author: talli...@mitre.org): I concur with Ken, if I understand this correctly, we can't do anything at the Tika level to prevent this from happening. I also agree with Phil [0], though, that we should probably catch and log this in POI rather than preventing the extraction from the entire document. Mind opening an issue in POI's bugzilla and add a link to this issue? I'll see what I can do...prob won't be until early next week, and then we'll have to wait for the next version of POI before we'll see different behavior in Tika. Or, has this already been fixed in POI [1]? If so, we'll be updating soon (TIKA-1799) once the transfer to git has finished. [0] https://mail-archives.apache.org/mod_mbox/poi-user/201303.mbox/%3CCABWW=XW_YBOG-FtJ2Jqu+tqXU85Vk_8nUP=L=vq_mn4ng2c...@mail.gmail.com%3E [1] http://comments.gmane.org/gmane.comp.jakarta.poi.devel/27039 > Convertion DOC->TXT failed due to POI issue > --- > > Key: TIKA-1836 > URL: https://issues.apache.org/jira/browse/TIKA-1836 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.11 > Environment: Distributor ID: Ubuntu > Description: Ubuntu 12.04.5 LTS > Release: 12.04 > Codename: precise > java version "1.7.0_91" > OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.12.04.1) > OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode) >Reporter: Jorge Spinsanti > Attachments: test.doc > > > When we try to convert DOC -> TXT, I got the next stack trace: > {code} > Caused by: org.apache.tika.exception.TikaException: Unexpected > RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1ddeedb6 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 15 more > Caused by: java.lang.UnsupportedOperationException: Non-extended character > Pascal strings are not supported right now. Please, contact POI developers > for update. > at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:82) > at org.apache.poi.hwpf.model.Sttb.(Sttb.java:61) > at > org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52) > at org.apache.poi.hwpf.model.SavedByTable.(SavedByTable.java:53) > at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:361) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 22 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1836) Convertion DOC->TXT failed due to POI issue
[ https://issues.apache.org/jira/browse/TIKA-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107080#comment-15107080 ] Tim Allison commented on TIKA-1836: --- Not already fixed in POI: this is still open: https://bz.apache.org/bugzilla/show_bug.cgi?id=56880 > Convertion DOC->TXT failed due to POI issue > --- > > Key: TIKA-1836 > URL: https://issues.apache.org/jira/browse/TIKA-1836 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.11 > Environment: Distributor ID: Ubuntu > Description: Ubuntu 12.04.5 LTS > Release: 12.04 > Codename: precise > java version "1.7.0_91" > OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.12.04.1) > OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode) >Reporter: Jorge Spinsanti > Attachments: test.doc > > > When we try to convert DOC -> TXT, I got the next stack trace: > {code} > Caused by: org.apache.tika.exception.TikaException: Unexpected > RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1ddeedb6 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 15 more > Caused by: java.lang.UnsupportedOperationException: Non-extended character > Pascal strings are not supported right now. Please, contact POI developers > for update. > at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:82) > at org.apache.poi.hwpf.model.Sttb.(Sttb.java:61) > at > org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52) > at org.apache.poi.hwpf.model.SavedByTable.(SavedByTable.java:53) > at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:361) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 22 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1836) Convertion DOC->TXT failed due to POI issue
[ https://issues.apache.org/jira/browse/TIKA-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107067#comment-15107067 ] Tim Allison edited comment on TIKA-1836 at 1/19/16 5:50 PM: I concur with Ken, if I understand this correctly, we can't do anything at the Tika level to prevent this from happening. I also agree with Phil [0], though, that we should probably catch and log this in POI rather than preventing the extraction from the entire document. Mind opening an issue in POI's bugzilla and add a link to this issue? I'll see what I can do...prob won't be until early next week, and then we'll have to wait for the next version of POI before we'll see different behavior in Tika. Or, has this already been fixed in POI [1]? If so, we'll be updating soon (TIKA-1799) once the transfer to git has finished. [0] https://mail-archives.apache.org/mod_mbox/poi-user/201303.mbox/%3CCABWW=XW_YBOG-FtJ2Jqu+tqXU85Vk_8nUP=L=vq_mn4ng2c...@mail.gmail.com%3E [1] http://comments.gmane.org/gmane.comp.jakarta.poi.devel/27039 was (Author: talli...@mitre.org): I concur with Ken, if I understand this correctly, we can't do anything at the Tika level to prevent this from happening. I also agree with Phil [0], though, that we should probably catch and log this in POI rather than preventing the extraction from the entire document. Mind opening an issue in POI's bugzilla and add a link to this issue? I'll see what I can do...prob won't be until early next week, and then we'll have to wait for the next version of POI before we'll see different behavior in Tika. [0] https://mail-archives.apache.org/mod_mbox/poi-user/201303.mbox/%3CCABWW=XW_YBOG-FtJ2Jqu+tqXU85Vk_8nUP=L=vq_mn4ng2c...@mail.gmail.com%3E > Convertion DOC->TXT failed due to POI issue > --- > > Key: TIKA-1836 > URL: https://issues.apache.org/jira/browse/TIKA-1836 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.11 > Environment: Distributor ID: Ubuntu > Description: Ubuntu 12.04.5 LTS > Release: 12.04 > Codename: precise > java version "1.7.0_91" > OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.12.04.1) > OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode) >Reporter: Jorge Spinsanti > Attachments: test.doc > > > When we try to convert DOC -> TXT, I got the next stack trace: > {code} > Caused by: org.apache.tika.exception.TikaException: Unexpected > RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1ddeedb6 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 15 more > Caused by: java.lang.UnsupportedOperationException: Non-extended character > Pascal strings are not supported right now. Please, contact POI developers > for update. > at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:82) > at org.apache.poi.hwpf.model.Sttb.(Sttb.java:61) > at > org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52) > at org.apache.poi.hwpf.model.SavedByTable.(SavedByTable.java:53) > at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:361) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 22 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1799) Upgrade to POI 3.14-Beta1 when available
[ https://issues.apache.org/jira/browse/TIKA-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107077#comment-15107077 ] Tim Allison commented on TIKA-1799: --- [~bobpaulin], I hate to bother you with this, but do you have any recommendations for the bundling issues we're seeing? Andi and Dominik have both taken a look [0]. Working integration (well non-working integration :) ) is here: https://github.com/tballison/tika/tree/poi-3_14_beta1 [0] http://mail-archives.apache.org/mod_mbox/poi-dev/201601.mbox/%3cby2pr09mb112b38091d6fef30cc59311c7...@by2pr09mb112.namprd09.prod.outlook.com%3e > Upgrade to POI 3.14-Beta1 when available > > > Key: TIKA-1799 > URL: https://issues.apache.org/jira/browse/TIKA-1799 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Attachments: 349008.ppt, 349008.ppt.json > > > Should be out in the next week or two. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1836) Convertion DOC->TXT failed due to POI issue
[ https://issues.apache.org/jira/browse/TIKA-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107067#comment-15107067 ] Tim Allison commented on TIKA-1836: --- I concur with Ken, if I understand this correctly, we can't do anything at the Tika level to prevent this from happening. I also agree with Phil [0], though, that we should probably catch and log this in POI rather than preventing the extraction from the entire document. Mind opening an issue in POI's bugzilla and add a link to this issue? I'll see what I can do...prob won't be until early next week, and then we'll have to wait for the next version of POI before we'll see different behavior in Tika. [0] https://mail-archives.apache.org/mod_mbox/poi-user/201303.mbox/%3CCABWW=XW_YBOG-FtJ2Jqu+tqXU85Vk_8nUP=L=vq_mn4ng2c...@mail.gmail.com%3E > Convertion DOC->TXT failed due to POI issue > --- > > Key: TIKA-1836 > URL: https://issues.apache.org/jira/browse/TIKA-1836 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.11 > Environment: Distributor ID: Ubuntu > Description: Ubuntu 12.04.5 LTS > Release: 12.04 > Codename: precise > java version "1.7.0_91" > OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.12.04.1) > OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode) >Reporter: Jorge Spinsanti > Attachments: test.doc > > > When we try to convert DOC -> TXT, I got the next stack trace: > {code} > Caused by: org.apache.tika.exception.TikaException: Unexpected > RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1ddeedb6 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 15 more > Caused by: java.lang.UnsupportedOperationException: Non-extended character > Pascal strings are not supported right now. Please, contact POI developers > for update. > at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:82) > at org.apache.poi.hwpf.model.Sttb.(Sttb.java:61) > at > org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52) > at org.apache.poi.hwpf.model.SavedByTable.(SavedByTable.java:53) > at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:361) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 22 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1836) Convertion DOC->TXT failed due to POI issue
[ https://issues.apache.org/jira/browse/TIKA-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107221#comment-15107221 ] Tim Allison commented on TIKA-1836: --- The better solution of course would be to add proper parsing for these types of currently unsupported fields. Any interest in submitting a patch over on POI-56880? :) > Convertion DOC->TXT failed due to POI issue > --- > > Key: TIKA-1836 > URL: https://issues.apache.org/jira/browse/TIKA-1836 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.11 > Environment: Distributor ID: Ubuntu > Description: Ubuntu 12.04.5 LTS > Release: 12.04 > Codename: precise > java version "1.7.0_91" > OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.12.04.1) > OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode) >Reporter: Jorge Spinsanti > Attachments: test.doc > > > When we try to convert DOC -> TXT, I got the next stack trace: > {code} > Caused by: org.apache.tika.exception.TikaException: Unexpected > RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1ddeedb6 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 15 more > Caused by: java.lang.UnsupportedOperationException: Non-extended character > Pascal strings are not supported right now. Please, contact POI developers > for update. > at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:82) > at org.apache.poi.hwpf.model.Sttb.(Sttb.java:61) > at > org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52) > at org.apache.poi.hwpf.model.SavedByTable.(SavedByTable.java:53) > at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:361) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 22 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1836) Convertion DOC->TXT failed due to POI issue
[ https://issues.apache.org/jira/browse/TIKA-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107212#comment-15107212 ] Jorge Spinsanti edited comment on TIKA-1836 at 1/19/16 7:08 PM: POI issue was report in 2014-08-22. Perhaps if TIKA (other Apache project) needs the fix, TIKA team can push to increase the priority/importance. was (Author: giorgy): POI issue was report in 2014-08-22. Perhaps if TIKA needs the fix, TIKA team can push to increase the priority/importance. > Convertion DOC->TXT failed due to POI issue > --- > > Key: TIKA-1836 > URL: https://issues.apache.org/jira/browse/TIKA-1836 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.11 > Environment: Distributor ID: Ubuntu > Description: Ubuntu 12.04.5 LTS > Release: 12.04 > Codename: precise > java version "1.7.0_91" > OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.12.04.1) > OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode) >Reporter: Jorge Spinsanti > Attachments: test.doc > > > When we try to convert DOC -> TXT, I got the next stack trace: > {code} > Caused by: org.apache.tika.exception.TikaException: Unexpected > RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1ddeedb6 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 15 more > Caused by: java.lang.UnsupportedOperationException: Non-extended character > Pascal strings are not supported right now. Please, contact POI developers > for update. > at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:82) > at org.apache.poi.hwpf.model.Sttb.(Sttb.java:61) > at > org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52) > at org.apache.poi.hwpf.model.SavedByTable.(SavedByTable.java:53) > at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:361) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 22 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1836) Convertion DOC->TXT failed due to POI issue
[ https://issues.apache.org/jira/browse/TIKA-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107212#comment-15107212 ] Jorge Spinsanti commented on TIKA-1836: --- POI issue was report in 2014-08-22. Perhaps if TIKA needs the fix, TIKA team can push to increase the priority/importance. > Convertion DOC->TXT failed due to POI issue > --- > > Key: TIKA-1836 > URL: https://issues.apache.org/jira/browse/TIKA-1836 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.11 > Environment: Distributor ID: Ubuntu > Description: Ubuntu 12.04.5 LTS > Release: 12.04 > Codename: precise > java version "1.7.0_91" > OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.12.04.1) > OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode) >Reporter: Jorge Spinsanti > Attachments: test.doc > > > When we try to convert DOC -> TXT, I got the next stack trace: > {code} > Caused by: org.apache.tika.exception.TikaException: Unexpected > RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1ddeedb6 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 15 more > Caused by: java.lang.UnsupportedOperationException: Non-extended character > Pascal strings are not supported right now. Please, contact POI developers > for update. > at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:82) > at org.apache.poi.hwpf.model.Sttb.(Sttb.java:61) > at > org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52) > at org.apache.poi.hwpf.model.SavedByTable.(SavedByTable.java:53) > at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:361) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 22 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1836) Convertion DOC->TXT failed due to POI issue
[ https://issues.apache.org/jira/browse/TIKA-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106919#comment-15106919 ] Jorge Spinsanti edited comment on TIKA-1836 at 1/19/16 7:04 PM: POI is a dependency of TIKA. I think TIKA can evaluate to migrate the use of POI to new version. Or perhaps, TIKA can be manage this issue trying an alternative idea. was (Author: giorgy): POI is a dependency of TIKA. I think TIKA can be evaluate to migrate the use of POI to new version. Or perhaps, TIKA can be manage this issue trying an alternative idea. > Convertion DOC->TXT failed due to POI issue > --- > > Key: TIKA-1836 > URL: https://issues.apache.org/jira/browse/TIKA-1836 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.11 > Environment: Distributor ID: Ubuntu > Description: Ubuntu 12.04.5 LTS > Release: 12.04 > Codename: precise > java version "1.7.0_91" > OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.12.04.1) > OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode) >Reporter: Jorge Spinsanti > Attachments: test.doc > > > When we try to convert DOC -> TXT, I got the next stack trace: > {code} > Caused by: org.apache.tika.exception.TikaException: Unexpected > RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1ddeedb6 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 15 more > Caused by: java.lang.UnsupportedOperationException: Non-extended character > Pascal strings are not supported right now. Please, contact POI developers > for update. > at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:82) > at org.apache.poi.hwpf.model.Sttb.(Sttb.java:61) > at > org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52) > at org.apache.poi.hwpf.model.SavedByTable.(SavedByTable.java:53) > at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:361) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 22 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1799) Upgrade to POI 3.14-Beta1 when available
[ https://issues.apache.org/jira/browse/TIKA-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107690#comment-15107690 ] Andreas Beeker commented on TIKA-1799: -- I have no idea how osgi bundling works, but adding the sub-packages (if the base package approach doesn't work) was the recommendation in my original mail [1] I don't know what I should recommend here - why were originally only powerpoint and word optional? What is the effect of providing packages via poi-ooxml-schema and marking them as optional? [1] http://mail-archives.apache.org/mod_mbox/poi-dev/201601.mbox/%3c568d4ee1.7030...@apache.org%3E > Upgrade to POI 3.14-Beta1 when available > > > Key: TIKA-1799 > URL: https://issues.apache.org/jira/browse/TIKA-1799 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Attachments: 349008.ppt, 349008.ppt.json > > > Should be out in the next week or two. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1837) HtmlEncodingDetector wrongly detects charset from commented meta
Pascal Essiembre created TIKA-1837: -- Summary: HtmlEncodingDetector wrongly detects charset from commented meta Key: TIKA-1837 URL: https://issues.apache.org/jira/browse/TIKA-1837 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.11 Environment: Any. Reporter: Pascal Essiembre Priority: Minor The org.apache.tika.parser.html.HtmlEncodingDetector class will grab the first meta tag that has a charset in it matching the pattern defined in HTTP_META_PATTERN. The problem encountered is when there are multiple such meta tags but the first ones are commented. In my mind the detector should not consider commented code for this detection. Real example encountered in an HTML page: {code:xml} {code} The detector currently detects {{ISO-8859-1}} while it should detect {{utf-8}}. *Fix:* As opposed to modify the meta-detection regex, I recommend to first strip comments, taking into consideration the substring from the input stream may not hold the closing characters {{-->}}. This has been tested to work: {code:title=HtmlEncodingDetector.java, line 104+|borderStyle=solid} String head = ASCII.decode(ByteBuffer.wrap(buffer, 0, n)).toString(); // START FIX: head = head.replaceAll("|$)", ""); // END FIX Matcher equiv = HTTP_META_PATTERN.matcher(head); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1833) NoClassDefFoundError for POIXMLTypeLoader
[ https://issues.apache.org/jira/browse/TIKA-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106723#comment-15106723 ] Tim Allison commented on TIKA-1833: --- Ha. Ok. Great to hear. It doesn't surprise me that there might yet be surprises, but this one was surprising. :) Let us know when you find anything else that is curious, and happy extraction! > NoClassDefFoundError for POIXMLTypeLoader > - > > Key: TIKA-1833 > URL: https://issues.apache.org/jira/browse/TIKA-1833 > Project: Tika > Issue Type: Bug >Reporter: Mohammed Manna > > I downloaded tika-app-1.11.jar which has all the necessary dependencies > (checked using 7zip opener and checked the classes). I tried to parse .doc, > .docx files for my project, but it is throwing error (not exception). The > stack trace is as follows: > java.lang.NoClassDefFoundError: org/apache/poi/POIXMLTypeLoader > at > org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown > Source) > at > org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:158) > at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:167) > at > org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:119) > at > org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:59) > at > org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:204) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at xxx.xxx.xxx.xxx.xAttachmentWithTika(xxxService.java:792) > I browsed the package and couldn't find any POIXMLTypeLoader class. is this a > known issue? Could someone please respond to me? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1835) LinkContentHandler skips iframe and rel tags
[ https://issues.apache.org/jira/browse/TIKA-1835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated TIKA-1835: Flags: Patch,Important (was: Important) > LinkContentHandler skips iframe and rel tags > > > Key: TIKA-1835 > URL: https://issues.apache.org/jira/browse/TIKA-1835 > Project: Tika > Issue Type: Bug > Components: core >Affects Versions: 1.11 >Reporter: Markus Jelsma > Fix For: 1.12 > > > As simple as it gets, link and iframe tags were never implemented in > LinkContentHandler. NUTCH-1233 kind of requires it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1835) LinkContentHandler skips iframe and rel tags
[ https://issues.apache.org/jira/browse/TIKA-1835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated TIKA-1835: Attachment: TIKA-1835.patch Patch for trunk. Adds support for iframe and link element link extraction. Tests included. > LinkContentHandler skips iframe and rel tags > > > Key: TIKA-1835 > URL: https://issues.apache.org/jira/browse/TIKA-1835 > Project: Tika > Issue Type: Bug > Components: core >Affects Versions: 1.11 >Reporter: Markus Jelsma > Fix For: 1.12 > > Attachments: TIKA-1835.patch > > > As simple as it gets, link and iframe tags were never implemented in > LinkContentHandler. NUTCH-1233 kind of requires it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules
[ https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106752#comment-15106752 ] Tim Allison commented on TIKA-1824: --- Thank you, [~bobpaulin]! Again, this is fantastic. I should have a chance to take a look later today. [~chrismattmann], [~gagravarr], [~kkrugler], [~lewismc],[~rgauss] or others, any feedback on this massive refactoring? > Tika 2.0 - Create Initial Parser Modules > - > > Key: TIKA-1824 > URL: https://issues.apache.org/jira/browse/TIKA-1824 > Project: Tika > Issue Type: Improvement >Affects Versions: 2.0 >Reporter: Bob Paulin >Assignee: Bob Paulin > > Create initial break down of parser modules. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1836) Convertion DOC->TXT failed due to POI issue
Jorge Spinsanti created TIKA-1836: - Summary: Convertion DOC->TXT failed due to POI issue Key: TIKA-1836 URL: https://issues.apache.org/jira/browse/TIKA-1836 Project: Tika Issue Type: Bug Affects Versions: 1.11 Environment: Distributor ID: Ubuntu Description:Ubuntu 12.04.5 LTS Release:12.04 Codename: precise java version "1.7.0_91" OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.12.04.1) OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode) Reporter: Jorge Spinsanti When we try to convert DOC -> TXT, I got the next stack trace: {code} Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1ddeedb6 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ... 15 more Caused by: java.lang.UnsupportedOperationException: Non-extended character Pascal strings are not supported right now. Please, contact POI developers for update. at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:82) at org.apache.poi.hwpf.model.Sttb.(Sttb.java:61) at org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52) at org.apache.poi.hwpf.model.SavedByTable.(SavedByTable.java:53) at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:361) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ... 22 more {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1836) Convertion DOC->TXT failed due to POI issue
[ https://issues.apache.org/jira/browse/TIKA-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Spinsanti updated TIKA-1836: -- Component/s: parser > Convertion DOC->TXT failed due to POI issue > --- > > Key: TIKA-1836 > URL: https://issues.apache.org/jira/browse/TIKA-1836 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.11 > Environment: Distributor ID: Ubuntu > Description: Ubuntu 12.04.5 LTS > Release: 12.04 > Codename: precise > java version "1.7.0_91" > OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.12.04.1) > OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode) >Reporter: Jorge Spinsanti > Attachments: test.doc > > > When we try to convert DOC -> TXT, I got the next stack trace: > {code} > Caused by: org.apache.tika.exception.TikaException: Unexpected > RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1ddeedb6 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 15 more > Caused by: java.lang.UnsupportedOperationException: Non-extended character > Pascal strings are not supported right now. Please, contact POI developers > for update. > at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:82) > at org.apache.poi.hwpf.model.Sttb.(Sttb.java:61) > at > org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52) > at org.apache.poi.hwpf.model.SavedByTable.(SavedByTable.java:53) > at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:361) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 22 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1836) Convertion DOC->TXT failed due to POI issue
[ https://issues.apache.org/jira/browse/TIKA-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Spinsanti updated TIKA-1836: -- Attachment: test.doc File used to find the issue. > Convertion DOC->TXT failed due to POI issue > --- > > Key: TIKA-1836 > URL: https://issues.apache.org/jira/browse/TIKA-1836 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.11 > Environment: Distributor ID: Ubuntu > Description: Ubuntu 12.04.5 LTS > Release: 12.04 > Codename: precise > java version "1.7.0_91" > OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.12.04.1) > OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode) >Reporter: Jorge Spinsanti > Attachments: test.doc > > > When we try to convert DOC -> TXT, I got the next stack trace: > {code} > Caused by: org.apache.tika.exception.TikaException: Unexpected > RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1ddeedb6 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 15 more > Caused by: java.lang.UnsupportedOperationException: Non-extended character > Pascal strings are not supported right now. Please, contact POI developers > for update. > at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:82) > at org.apache.poi.hwpf.model.Sttb.(Sttb.java:61) > at > org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52) > at org.apache.poi.hwpf.model.SavedByTable.(SavedByTable.java:53) > at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:361) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 22 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1836) Convertion DOC->TXT failed due to POI issue
[ https://issues.apache.org/jira/browse/TIKA-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106908#comment-15106908 ] Ken Krugler commented on TIKA-1836: --- This seems to be an issue for POI, as per the message in the stack trace. Is there something you'd want Tika to do here? > Convertion DOC->TXT failed due to POI issue > --- > > Key: TIKA-1836 > URL: https://issues.apache.org/jira/browse/TIKA-1836 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.11 > Environment: Distributor ID: Ubuntu > Description: Ubuntu 12.04.5 LTS > Release: 12.04 > Codename: precise > java version "1.7.0_91" > OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.12.04.1) > OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode) >Reporter: Jorge Spinsanti > Attachments: test.doc > > > When we try to convert DOC -> TXT, I got the next stack trace: > {code} > Caused by: org.apache.tika.exception.TikaException: Unexpected > RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1ddeedb6 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 15 more > Caused by: java.lang.UnsupportedOperationException: Non-extended character > Pascal strings are not supported right now. Please, contact POI developers > for update. > at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:82) > at org.apache.poi.hwpf.model.Sttb.(Sttb.java:61) > at > org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52) > at org.apache.poi.hwpf.model.SavedByTable.(SavedByTable.java:53) > at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:361) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 22 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1836) Convertion DOC->TXT failed due to POI issue
[ https://issues.apache.org/jira/browse/TIKA-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106919#comment-15106919 ] Jorge Spinsanti commented on TIKA-1836: --- POI is a dependency of TIKA. I think TIKA can be evaluate to migrate the use of POI to new version. Or perhaps, TIKA can be manage this issue trying an alternative idea. > Convertion DOC->TXT failed due to POI issue > --- > > Key: TIKA-1836 > URL: https://issues.apache.org/jira/browse/TIKA-1836 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.11 > Environment: Distributor ID: Ubuntu > Description: Ubuntu 12.04.5 LTS > Release: 12.04 > Codename: precise > java version "1.7.0_91" > OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.12.04.1) > OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode) >Reporter: Jorge Spinsanti > Attachments: test.doc > > > When we try to convert DOC -> TXT, I got the next stack trace: > {code} > Caused by: org.apache.tika.exception.TikaException: Unexpected > RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1ddeedb6 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 15 more > Caused by: java.lang.UnsupportedOperationException: Non-extended character > Pascal strings are not supported right now. Please, contact POI developers > for update. > at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:82) > at org.apache.poi.hwpf.model.Sttb.(Sttb.java:61) > at > org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52) > at org.apache.poi.hwpf.model.SavedByTable.(SavedByTable.java:53) > at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:361) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 22 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)