[jira] [Comment Edited] (TIKA-1836) Convertion DOC->TXT failed due to POI issue

2016-03-29 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15216679#comment-15216679
 ] 

Tim Allison edited comment on TIKA-1836 at 3/29/16 7:16 PM:


No problem. 1.12 was cut in January before the upgrade in POI.  This is fixed 
in trunk/1.13...I just confirmed.  I didn't add a test because the test file at 
20kb felt too big for the rarity (presumed) of this issue.  I can add a test in 
Tika if we want it.


was (Author: talli...@mitre.org):
No problem. 1.12 was cut in January before we the upgrade in POI.  This is 
fixed in trunk/1.13...I just confirmed.  I didn't add a test because the test 
file at 20kb felt too big for the rarity (presumed) of this issue.  I can add a 
test in Tika if we want it.

> Convertion DOC->TXT failed due to POI issue
> ---
>
> Key: TIKA-1836
> URL: https://issues.apache.org/jira/browse/TIKA-1836
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
> Environment: Distributor ID:  Ubuntu
> Description:  Ubuntu 12.04.5 LTS
> Release:  12.04
> Codename: precise
> java version "1.7.0_91"
> OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.12.04.1)
> OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode)
>Reporter: Jorge Spinsanti
> Fix For: 1.13
>
> Attachments: test.doc
>
>
> When we try to convert DOC -> TXT, I got the next stack trace:
> {code}
> Caused by: org.apache.tika.exception.TikaException: Unexpected 
> RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1ddeedb6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 15 more
> Caused by: java.lang.UnsupportedOperationException: Non-extended character 
> Pascal strings are not supported right now. Please, contact POI developers 
> for update.
>   at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:82)
>   at org.apache.poi.hwpf.model.Sttb.(Sttb.java:61)
>   at 
> org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52)
>   at org.apache.poi.hwpf.model.SavedByTable.(SavedByTable.java:53)
>   at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:361)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 22 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1836) Convertion DOC->TXT failed due to POI issue

2016-03-29 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15216679#comment-15216679
 ] 

Tim Allison edited comment on TIKA-1836 at 3/29/16 7:15 PM:


No problem. 1.12 was cut in January before we the upgrade in POI.  This is 
fixed in trunk/1.13...I just confirmed.  I didn't add a test because the test 
file at 20kb felt too big for the rarity (presumed) of this issue.  I can add a 
test in Tika if we want it.


was (Author: talli...@mitre.org):
No problem. 1.12 was cut in January.  This is fixed in trunk/1.13...I just 
confirmed.  I didn't add a test because the test file at 20kb felt too big for 
the rarity (presumed) of this issue.  I can add a test in Tika if we want it.

> Convertion DOC->TXT failed due to POI issue
> ---
>
> Key: TIKA-1836
> URL: https://issues.apache.org/jira/browse/TIKA-1836
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
> Environment: Distributor ID:  Ubuntu
> Description:  Ubuntu 12.04.5 LTS
> Release:  12.04
> Codename: precise
> java version "1.7.0_91"
> OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.12.04.1)
> OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode)
>Reporter: Jorge Spinsanti
> Fix For: 1.13
>
> Attachments: test.doc
>
>
> When we try to convert DOC -> TXT, I got the next stack trace:
> {code}
> Caused by: org.apache.tika.exception.TikaException: Unexpected 
> RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1ddeedb6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 15 more
> Caused by: java.lang.UnsupportedOperationException: Non-extended character 
> Pascal strings are not supported right now. Please, contact POI developers 
> for update.
>   at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:82)
>   at org.apache.poi.hwpf.model.Sttb.(Sttb.java:61)
>   at 
> org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52)
>   at org.apache.poi.hwpf.model.SavedByTable.(SavedByTable.java:53)
>   at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:361)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 22 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1836) Convertion DOC->TXT failed due to POI issue

2016-01-19 Thread Jorge Spinsanti (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15107212#comment-15107212
 ] 

Jorge Spinsanti edited comment on TIKA-1836 at 1/19/16 7:08 PM:


POI issue was report in 2014-08-22. Perhaps if TIKA (other Apache project) 
needs the fix, TIKA team can push to increase the priority/importance.


was (Author: giorgy):
POI issue was report in 2014-08-22. Perhaps if TIKA needs the fix, TIKA team 
can push to increase the priority/importance.

> Convertion DOC->TXT failed due to POI issue
> ---
>
> Key: TIKA-1836
> URL: https://issues.apache.org/jira/browse/TIKA-1836
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
> Environment: Distributor ID:  Ubuntu
> Description:  Ubuntu 12.04.5 LTS
> Release:  12.04
> Codename: precise
> java version "1.7.0_91"
> OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.12.04.1)
> OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode)
>Reporter: Jorge Spinsanti
> Attachments: test.doc
>
>
> When we try to convert DOC -> TXT, I got the next stack trace:
> {code}
> Caused by: org.apache.tika.exception.TikaException: Unexpected 
> RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1ddeedb6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 15 more
> Caused by: java.lang.UnsupportedOperationException: Non-extended character 
> Pascal strings are not supported right now. Please, contact POI developers 
> for update.
>   at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:82)
>   at org.apache.poi.hwpf.model.Sttb.(Sttb.java:61)
>   at 
> org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52)
>   at org.apache.poi.hwpf.model.SavedByTable.(SavedByTable.java:53)
>   at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:361)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 22 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1836) Convertion DOC->TXT failed due to POI issue

2016-01-19 Thread Jorge Spinsanti (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15106919#comment-15106919
 ] 

Jorge Spinsanti edited comment on TIKA-1836 at 1/19/16 7:04 PM:


POI is a dependency of TIKA. I think TIKA can evaluate to migrate the use of 
POI to new version. Or perhaps, TIKA can be manage this issue trying an 
alternative idea.


was (Author: giorgy):
POI is a dependency of TIKA. I think TIKA can be evaluate to migrate the use of 
POI to new version. Or perhaps, TIKA can be manage this issue trying an 
alternative idea.

> Convertion DOC->TXT failed due to POI issue
> ---
>
> Key: TIKA-1836
> URL: https://issues.apache.org/jira/browse/TIKA-1836
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
> Environment: Distributor ID:  Ubuntu
> Description:  Ubuntu 12.04.5 LTS
> Release:  12.04
> Codename: precise
> java version "1.7.0_91"
> OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.12.04.1)
> OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode)
>Reporter: Jorge Spinsanti
> Attachments: test.doc
>
>
> When we try to convert DOC -> TXT, I got the next stack trace:
> {code}
> Caused by: org.apache.tika.exception.TikaException: Unexpected 
> RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1ddeedb6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 15 more
> Caused by: java.lang.UnsupportedOperationException: Non-extended character 
> Pascal strings are not supported right now. Please, contact POI developers 
> for update.
>   at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:82)
>   at org.apache.poi.hwpf.model.Sttb.(Sttb.java:61)
>   at 
> org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52)
>   at org.apache.poi.hwpf.model.SavedByTable.(SavedByTable.java:53)
>   at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:361)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 22 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1836) Convertion DOC->TXT failed due to POI issue

2016-01-19 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15107067#comment-15107067
 ] 

Tim Allison edited comment on TIKA-1836 at 1/19/16 5:57 PM:


I concur with Ken, if I understand this correctly, we can't do anything at the 
Tika level to prevent this from happening.  I also agree with Phil [0], though, 
that we should probably catch and log this in POI rather than preventing the 
extraction from the entire document.  Mind opening an issue in POI's bugzilla 
and add a link to this issue?  I'll see what I can do...prob won't be until 
early next week, and then we'll have to wait for the next version of POI before 
we'll see different behavior in Tika.

-Or, has this already been fixed in POI [1]?  If so, we'll be updating soon 
(TIKA-1799) once the transfer to git has finished.-

[0] 
https://mail-archives.apache.org/mod_mbox/poi-user/201303.mbox/%3CCABWW=XW_YBOG-FtJ2Jqu+tqXU85Vk_8nUP=L=vq_mn4ng2c...@mail.gmail.com%3E

[1] http://comments.gmane.org/gmane.comp.jakarta.poi.devel/27039


was (Author: talli...@mitre.org):
I concur with Ken, if I understand this correctly, we can't do anything at the 
Tika level to prevent this from happening.  I also agree with Phil [0], though, 
that we should probably catch and log this in POI rather than preventing the 
extraction from the entire document.  Mind opening an issue in POI's bugzilla 
and add a link to this issue?  I'll see what I can do...prob won't be until 
early next week, and then we'll have to wait for the next version of POI before 
we'll see different behavior in Tika.

Or, has this already been fixed in POI [1]?  If so, we'll be updating soon 
(TIKA-1799) once the transfer to git has finished.

[0] 
https://mail-archives.apache.org/mod_mbox/poi-user/201303.mbox/%3CCABWW=XW_YBOG-FtJ2Jqu+tqXU85Vk_8nUP=L=vq_mn4ng2c...@mail.gmail.com%3E

[1] http://comments.gmane.org/gmane.comp.jakarta.poi.devel/27039

> Convertion DOC->TXT failed due to POI issue
> ---
>
> Key: TIKA-1836
> URL: https://issues.apache.org/jira/browse/TIKA-1836
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
> Environment: Distributor ID:  Ubuntu
> Description:  Ubuntu 12.04.5 LTS
> Release:  12.04
> Codename: precise
> java version "1.7.0_91"
> OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.12.04.1)
> OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode)
>Reporter: Jorge Spinsanti
> Attachments: test.doc
>
>
> When we try to convert DOC -> TXT, I got the next stack trace:
> {code}
> Caused by: org.apache.tika.exception.TikaException: Unexpected 
> RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1ddeedb6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 15 more
> Caused by: java.lang.UnsupportedOperationException: Non-extended character 
> Pascal strings are not supported right now. Please, contact POI developers 
> for update.
>   at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:82)
>   at org.apache.poi.hwpf.model.Sttb.(Sttb.java:61)
>   at 
> org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52)
>   at org.apache.poi.hwpf.model.SavedByTable.(SavedByTable.java:53)
>   at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:361)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 22 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1836) Convertion DOC->TXT failed due to POI issue

2016-01-19 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15107067#comment-15107067
 ] 

Tim Allison edited comment on TIKA-1836 at 1/19/16 5:50 PM:


I concur with Ken, if I understand this correctly, we can't do anything at the 
Tika level to prevent this from happening.  I also agree with Phil [0], though, 
that we should probably catch and log this in POI rather than preventing the 
extraction from the entire document.  Mind opening an issue in POI's bugzilla 
and add a link to this issue?  I'll see what I can do...prob won't be until 
early next week, and then we'll have to wait for the next version of POI before 
we'll see different behavior in Tika.

Or, has this already been fixed in POI [1]?  If so, we'll be updating soon 
(TIKA-1799) once the transfer to git has finished.

[0] 
https://mail-archives.apache.org/mod_mbox/poi-user/201303.mbox/%3CCABWW=XW_YBOG-FtJ2Jqu+tqXU85Vk_8nUP=L=vq_mn4ng2c...@mail.gmail.com%3E

[1] http://comments.gmane.org/gmane.comp.jakarta.poi.devel/27039


was (Author: talli...@mitre.org):
I concur with Ken, if I understand this correctly, we can't do anything at the 
Tika level to prevent this from happening.  I also agree with Phil [0], though, 
that we should probably catch and log this in POI rather than preventing the 
extraction from the entire document.  Mind opening an issue in POI's bugzilla 
and add a link to this issue?  I'll see what I can do...prob won't be until 
early next week, and then we'll have to wait for the next version of POI before 
we'll see different behavior in Tika.

[0] 
https://mail-archives.apache.org/mod_mbox/poi-user/201303.mbox/%3CCABWW=XW_YBOG-FtJ2Jqu+tqXU85Vk_8nUP=L=vq_mn4ng2c...@mail.gmail.com%3E

> Convertion DOC->TXT failed due to POI issue
> ---
>
> Key: TIKA-1836
> URL: https://issues.apache.org/jira/browse/TIKA-1836
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
> Environment: Distributor ID:  Ubuntu
> Description:  Ubuntu 12.04.5 LTS
> Release:  12.04
> Codename: precise
> java version "1.7.0_91"
> OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.12.04.1)
> OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode)
>Reporter: Jorge Spinsanti
> Attachments: test.doc
>
>
> When we try to convert DOC -> TXT, I got the next stack trace:
> {code}
> Caused by: org.apache.tika.exception.TikaException: Unexpected 
> RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1ddeedb6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 15 more
> Caused by: java.lang.UnsupportedOperationException: Non-extended character 
> Pascal strings are not supported right now. Please, contact POI developers 
> for update.
>   at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:82)
>   at org.apache.poi.hwpf.model.Sttb.(Sttb.java:61)
>   at 
> org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52)
>   at org.apache.poi.hwpf.model.SavedByTable.(SavedByTable.java:53)
>   at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:361)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 22 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)