[jira] [Created] (TIKA-1835) LinkContentHandler skips iframe and rel tags

2016-01-19 Thread Markus Jelsma (JIRA)
Markus Jelsma created TIKA-1835:
---

 Summary: LinkContentHandler skips iframe and rel tags
 Key: TIKA-1835
 URL: https://issues.apache.org/jira/browse/TIKA-1835
 Project: Tika
  Issue Type: Bug
  Components: core
Affects Versions: 1.11
Reporter: Markus Jelsma
 Fix For: 1.12


As simple as it gets, link and iframe tags were never implemented in 
LinkContentHandler. NUTCH-1233 kind of requires it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1823) Support detecting DWF format

2016-01-19 Thread Luca Moretti (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Moretti updated TIKA-1823:
---
Attachment: blocks_and_tables.dwf

I found this file on the Autodesk website that could be a suitably licensed 
sample.
The file can be found at the following location:
https://knowledge.autodesk.com/support/autocad/downloads/caas/downloads/content/autocad-sample-files.html

> Support detecting DWF format
> 
>
> Key: TIKA-1823
> URL: https://issues.apache.org/jira/browse/TIKA-1823
> Project: Tika
>  Issue Type: Improvement
>  Components: detector, mime
>Reporter: Luca Moretti
>Priority: Minor
>  Labels: detection, dwf, mime
> Attachments: blocks_and_tables.dwf
>
>
> Tika currently detects dwf files as application/octect-stream.
> To make Tika mime magic detector correctly recognize dwf files it should be 
> added this code fragment in _tika-mimetypes.xml_ registry:
> {code:xml}
> 
>   dwf
>   <_comment>Design Web Format
>   
>   
>   
>   
>   
>   
>   
>   
> 
> {code}
> \\
> In current version (DWF 6.0), dwf file is a ZIP-compressed container for 
> vector-based CAD drawings. It is basically a ZIP archive with the _(DWF 
> V06.00)_ signature added before the regular ZIP magic number. For this 
> reason, the match value to detect dwf files should be: {{(DWF V06.00)PK}}.
> In the previous versions, the dwf data transport isn't a ZIP file format, so 
> the magic number is only the _(DWF V00.55)_ signature in the file header.
> To make Tika detect dwf files with this version too I propose the match value 
> in the code above.
> Thanks,
> Luca
> \\
> P.S.: The DWF format specification is included in the DWF Toolkit. The DWF 
> Toolkit is available for free at [http://www.autodesk.com/dwftoolkit]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1799) Upgrade to POI 3.14-Beta1 when available

2016-01-19 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107368#comment-15107368
 ] 

Tim Allison commented on TIKA-1799:
---

[~kiwiwings], looks like we have to specify packages after *.office, word, 
powerpoint, etc.  The bundle build works in Tika with just powerpoint and word 
set to optional, should we add visio, excel, etc?

> Upgrade to POI 3.14-Beta1 when available
> 
>
> Key: TIKA-1799
> URL: https://issues.apache.org/jira/browse/TIKA-1799
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Attachments: 349008.ppt, 349008.ppt.json
>
>
> Should be out in the next week or two.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1799) Upgrade to POI 3.14-Beta1 when available

2016-01-19 Thread Bob Paulin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107395#comment-15107395
 ] 

Bob Paulin commented on TIKA-1799:
--

Actually I'd be careful using the wildcard here because I think 
poi-ooxml-scheme provides the visio, office and excel packages.  So I don't 
think they should be optional.

> Upgrade to POI 3.14-Beta1 when available
> 
>
> Key: TIKA-1799
> URL: https://issues.apache.org/jira/browse/TIKA-1799
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Attachments: 349008.ppt, 349008.ppt.json
>
>
> Should be out in the next week or two.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1799) Upgrade to POI 3.14-Beta1 when available

2016-01-19 Thread Bob Paulin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107392#comment-15107392
 ] 

Bob Paulin commented on TIKA-1799:
--

So it's actually a pretty interesting question.  If you wanted to make all the 
subpackages of com.microsoft.schemas.office optional you should be able to do:

{code}
com.microsoft.schemas.office.*;resolution:=optional,
{code}

All of these settings are based on BND http://www.aqute.biz/Bnd/Bnd .

Not sure we want to wildcard in this case but I believe that would also work.  
All things equal I prefer explicitly listing optional packages.

> Upgrade to POI 3.14-Beta1 when available
> 
>
> Key: TIKA-1799
> URL: https://issues.apache.org/jira/browse/TIKA-1799
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Attachments: 349008.ppt, 349008.ppt.json
>
>
> Should be out in the next week or two.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1836) Convertion DOC->TXT failed due to POI issue

2016-01-19 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107216#comment-15107216
 ] 

Tim Allison commented on TIKA-1836:
---

Y, done.  I asked POI colleagues if they minded if we logged this instead of 
throwing an exception.  If there are no dissenting opinions, I'll make the 
change in POI early next week.

> Convertion DOC->TXT failed due to POI issue
> ---
>
> Key: TIKA-1836
> URL: https://issues.apache.org/jira/browse/TIKA-1836
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
> Environment: Distributor ID:  Ubuntu
> Description:  Ubuntu 12.04.5 LTS
> Release:  12.04
> Codename: precise
> java version "1.7.0_91"
> OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.12.04.1)
> OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode)
>Reporter: Jorge Spinsanti
> Attachments: test.doc
>
>
> When we try to convert DOC -> TXT, I got the next stack trace:
> {code}
> Caused by: org.apache.tika.exception.TikaException: Unexpected 
> RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1ddeedb6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 15 more
> Caused by: java.lang.UnsupportedOperationException: Non-extended character 
> Pascal strings are not supported right now. Please, contact POI developers 
> for update.
>   at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:82)
>   at org.apache.poi.hwpf.model.Sttb.(Sttb.java:61)
>   at 
> org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52)
>   at org.apache.poi.hwpf.model.SavedByTable.(SavedByTable.java:53)
>   at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:361)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 22 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1836) Convertion DOC->TXT failed due to POI issue

2016-01-19 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107067#comment-15107067
 ] 

Tim Allison edited comment on TIKA-1836 at 1/19/16 5:57 PM:


I concur with Ken, if I understand this correctly, we can't do anything at the 
Tika level to prevent this from happening.  I also agree with Phil [0], though, 
that we should probably catch and log this in POI rather than preventing the 
extraction from the entire document.  Mind opening an issue in POI's bugzilla 
and add a link to this issue?  I'll see what I can do...prob won't be until 
early next week, and then we'll have to wait for the next version of POI before 
we'll see different behavior in Tika.

-Or, has this already been fixed in POI [1]?  If so, we'll be updating soon 
(TIKA-1799) once the transfer to git has finished.-

[0] 
https://mail-archives.apache.org/mod_mbox/poi-user/201303.mbox/%3CCABWW=XW_YBOG-FtJ2Jqu+tqXU85Vk_8nUP=L=vq_mn4ng2c...@mail.gmail.com%3E

[1] http://comments.gmane.org/gmane.comp.jakarta.poi.devel/27039


was (Author: talli...@mitre.org):
I concur with Ken, if I understand this correctly, we can't do anything at the 
Tika level to prevent this from happening.  I also agree with Phil [0], though, 
that we should probably catch and log this in POI rather than preventing the 
extraction from the entire document.  Mind opening an issue in POI's bugzilla 
and add a link to this issue?  I'll see what I can do...prob won't be until 
early next week, and then we'll have to wait for the next version of POI before 
we'll see different behavior in Tika.

Or, has this already been fixed in POI [1]?  If so, we'll be updating soon 
(TIKA-1799) once the transfer to git has finished.

[0] 
https://mail-archives.apache.org/mod_mbox/poi-user/201303.mbox/%3CCABWW=XW_YBOG-FtJ2Jqu+tqXU85Vk_8nUP=L=vq_mn4ng2c...@mail.gmail.com%3E

[1] http://comments.gmane.org/gmane.comp.jakarta.poi.devel/27039

> Convertion DOC->TXT failed due to POI issue
> ---
>
> Key: TIKA-1836
> URL: https://issues.apache.org/jira/browse/TIKA-1836
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
> Environment: Distributor ID:  Ubuntu
> Description:  Ubuntu 12.04.5 LTS
> Release:  12.04
> Codename: precise
> java version "1.7.0_91"
> OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.12.04.1)
> OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode)
>Reporter: Jorge Spinsanti
> Attachments: test.doc
>
>
> When we try to convert DOC -> TXT, I got the next stack trace:
> {code}
> Caused by: org.apache.tika.exception.TikaException: Unexpected 
> RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1ddeedb6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 15 more
> Caused by: java.lang.UnsupportedOperationException: Non-extended character 
> Pascal strings are not supported right now. Please, contact POI developers 
> for update.
>   at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:82)
>   at org.apache.poi.hwpf.model.Sttb.(Sttb.java:61)
>   at 
> org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52)
>   at org.apache.poi.hwpf.model.SavedByTable.(SavedByTable.java:53)
>   at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:361)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 22 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1836) Convertion DOC->TXT failed due to POI issue

2016-01-19 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107080#comment-15107080
 ] 

Tim Allison commented on TIKA-1836:
---

Not already fixed in POI:  this is still open: 
https://bz.apache.org/bugzilla/show_bug.cgi?id=56880

> Convertion DOC->TXT failed due to POI issue
> ---
>
> Key: TIKA-1836
> URL: https://issues.apache.org/jira/browse/TIKA-1836
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
> Environment: Distributor ID:  Ubuntu
> Description:  Ubuntu 12.04.5 LTS
> Release:  12.04
> Codename: precise
> java version "1.7.0_91"
> OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.12.04.1)
> OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode)
>Reporter: Jorge Spinsanti
> Attachments: test.doc
>
>
> When we try to convert DOC -> TXT, I got the next stack trace:
> {code}
> Caused by: org.apache.tika.exception.TikaException: Unexpected 
> RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1ddeedb6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 15 more
> Caused by: java.lang.UnsupportedOperationException: Non-extended character 
> Pascal strings are not supported right now. Please, contact POI developers 
> for update.
>   at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:82)
>   at org.apache.poi.hwpf.model.Sttb.(Sttb.java:61)
>   at 
> org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52)
>   at org.apache.poi.hwpf.model.SavedByTable.(SavedByTable.java:53)
>   at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:361)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 22 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1836) Convertion DOC->TXT failed due to POI issue

2016-01-19 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107067#comment-15107067
 ] 

Tim Allison edited comment on TIKA-1836 at 1/19/16 5:50 PM:


I concur with Ken, if I understand this correctly, we can't do anything at the 
Tika level to prevent this from happening.  I also agree with Phil [0], though, 
that we should probably catch and log this in POI rather than preventing the 
extraction from the entire document.  Mind opening an issue in POI's bugzilla 
and add a link to this issue?  I'll see what I can do...prob won't be until 
early next week, and then we'll have to wait for the next version of POI before 
we'll see different behavior in Tika.

Or, has this already been fixed in POI [1]?  If so, we'll be updating soon 
(TIKA-1799) once the transfer to git has finished.

[0] 
https://mail-archives.apache.org/mod_mbox/poi-user/201303.mbox/%3CCABWW=XW_YBOG-FtJ2Jqu+tqXU85Vk_8nUP=L=vq_mn4ng2c...@mail.gmail.com%3E

[1] http://comments.gmane.org/gmane.comp.jakarta.poi.devel/27039


was (Author: talli...@mitre.org):
I concur with Ken, if I understand this correctly, we can't do anything at the 
Tika level to prevent this from happening.  I also agree with Phil [0], though, 
that we should probably catch and log this in POI rather than preventing the 
extraction from the entire document.  Mind opening an issue in POI's bugzilla 
and add a link to this issue?  I'll see what I can do...prob won't be until 
early next week, and then we'll have to wait for the next version of POI before 
we'll see different behavior in Tika.

[0] 
https://mail-archives.apache.org/mod_mbox/poi-user/201303.mbox/%3CCABWW=XW_YBOG-FtJ2Jqu+tqXU85Vk_8nUP=L=vq_mn4ng2c...@mail.gmail.com%3E

> Convertion DOC->TXT failed due to POI issue
> ---
>
> Key: TIKA-1836
> URL: https://issues.apache.org/jira/browse/TIKA-1836
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
> Environment: Distributor ID:  Ubuntu
> Description:  Ubuntu 12.04.5 LTS
> Release:  12.04
> Codename: precise
> java version "1.7.0_91"
> OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.12.04.1)
> OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode)
>Reporter: Jorge Spinsanti
> Attachments: test.doc
>
>
> When we try to convert DOC -> TXT, I got the next stack trace:
> {code}
> Caused by: org.apache.tika.exception.TikaException: Unexpected 
> RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1ddeedb6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 15 more
> Caused by: java.lang.UnsupportedOperationException: Non-extended character 
> Pascal strings are not supported right now. Please, contact POI developers 
> for update.
>   at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:82)
>   at org.apache.poi.hwpf.model.Sttb.(Sttb.java:61)
>   at 
> org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52)
>   at org.apache.poi.hwpf.model.SavedByTable.(SavedByTable.java:53)
>   at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:361)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 22 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1799) Upgrade to POI 3.14-Beta1 when available

2016-01-19 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107077#comment-15107077
 ] 

Tim Allison commented on TIKA-1799:
---

[~bobpaulin], I hate to bother you with this, but do you have any 
recommendations for the bundling issues we're seeing?  Andi and Dominik have 
both taken a look [0].  Working integration (well non-working integration :) ) 
is here: https://github.com/tballison/tika/tree/poi-3_14_beta1

[0] 
http://mail-archives.apache.org/mod_mbox/poi-dev/201601.mbox/%3cby2pr09mb112b38091d6fef30cc59311c7...@by2pr09mb112.namprd09.prod.outlook.com%3e

> Upgrade to POI 3.14-Beta1 when available
> 
>
> Key: TIKA-1799
> URL: https://issues.apache.org/jira/browse/TIKA-1799
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Attachments: 349008.ppt, 349008.ppt.json
>
>
> Should be out in the next week or two.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1836) Convertion DOC->TXT failed due to POI issue

2016-01-19 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107067#comment-15107067
 ] 

Tim Allison commented on TIKA-1836:
---

I concur with Ken, if I understand this correctly, we can't do anything at the 
Tika level to prevent this from happening.  I also agree with Phil [0], though, 
that we should probably catch and log this in POI rather than preventing the 
extraction from the entire document.  Mind opening an issue in POI's bugzilla 
and add a link to this issue?  I'll see what I can do...prob won't be until 
early next week, and then we'll have to wait for the next version of POI before 
we'll see different behavior in Tika.

[0] 
https://mail-archives.apache.org/mod_mbox/poi-user/201303.mbox/%3CCABWW=XW_YBOG-FtJ2Jqu+tqXU85Vk_8nUP=L=vq_mn4ng2c...@mail.gmail.com%3E

> Convertion DOC->TXT failed due to POI issue
> ---
>
> Key: TIKA-1836
> URL: https://issues.apache.org/jira/browse/TIKA-1836
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
> Environment: Distributor ID:  Ubuntu
> Description:  Ubuntu 12.04.5 LTS
> Release:  12.04
> Codename: precise
> java version "1.7.0_91"
> OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.12.04.1)
> OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode)
>Reporter: Jorge Spinsanti
> Attachments: test.doc
>
>
> When we try to convert DOC -> TXT, I got the next stack trace:
> {code}
> Caused by: org.apache.tika.exception.TikaException: Unexpected 
> RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1ddeedb6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 15 more
> Caused by: java.lang.UnsupportedOperationException: Non-extended character 
> Pascal strings are not supported right now. Please, contact POI developers 
> for update.
>   at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:82)
>   at org.apache.poi.hwpf.model.Sttb.(Sttb.java:61)
>   at 
> org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52)
>   at org.apache.poi.hwpf.model.SavedByTable.(SavedByTable.java:53)
>   at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:361)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 22 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1836) Convertion DOC->TXT failed due to POI issue

2016-01-19 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107221#comment-15107221
 ] 

Tim Allison commented on TIKA-1836:
---

The better solution of course would be to add proper parsing for these types of 
currently unsupported fields.  Any interest in submitting a patch over on 
POI-56880? :)

> Convertion DOC->TXT failed due to POI issue
> ---
>
> Key: TIKA-1836
> URL: https://issues.apache.org/jira/browse/TIKA-1836
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
> Environment: Distributor ID:  Ubuntu
> Description:  Ubuntu 12.04.5 LTS
> Release:  12.04
> Codename: precise
> java version "1.7.0_91"
> OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.12.04.1)
> OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode)
>Reporter: Jorge Spinsanti
> Attachments: test.doc
>
>
> When we try to convert DOC -> TXT, I got the next stack trace:
> {code}
> Caused by: org.apache.tika.exception.TikaException: Unexpected 
> RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1ddeedb6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 15 more
> Caused by: java.lang.UnsupportedOperationException: Non-extended character 
> Pascal strings are not supported right now. Please, contact POI developers 
> for update.
>   at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:82)
>   at org.apache.poi.hwpf.model.Sttb.(Sttb.java:61)
>   at 
> org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52)
>   at org.apache.poi.hwpf.model.SavedByTable.(SavedByTable.java:53)
>   at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:361)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 22 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1836) Convertion DOC->TXT failed due to POI issue

2016-01-19 Thread Jorge Spinsanti (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107212#comment-15107212
 ] 

Jorge Spinsanti edited comment on TIKA-1836 at 1/19/16 7:08 PM:


POI issue was report in 2014-08-22. Perhaps if TIKA (other Apache project) 
needs the fix, TIKA team can push to increase the priority/importance.


was (Author: giorgy):
POI issue was report in 2014-08-22. Perhaps if TIKA needs the fix, TIKA team 
can push to increase the priority/importance.

> Convertion DOC->TXT failed due to POI issue
> ---
>
> Key: TIKA-1836
> URL: https://issues.apache.org/jira/browse/TIKA-1836
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
> Environment: Distributor ID:  Ubuntu
> Description:  Ubuntu 12.04.5 LTS
> Release:  12.04
> Codename: precise
> java version "1.7.0_91"
> OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.12.04.1)
> OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode)
>Reporter: Jorge Spinsanti
> Attachments: test.doc
>
>
> When we try to convert DOC -> TXT, I got the next stack trace:
> {code}
> Caused by: org.apache.tika.exception.TikaException: Unexpected 
> RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1ddeedb6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 15 more
> Caused by: java.lang.UnsupportedOperationException: Non-extended character 
> Pascal strings are not supported right now. Please, contact POI developers 
> for update.
>   at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:82)
>   at org.apache.poi.hwpf.model.Sttb.(Sttb.java:61)
>   at 
> org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52)
>   at org.apache.poi.hwpf.model.SavedByTable.(SavedByTable.java:53)
>   at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:361)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 22 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1836) Convertion DOC->TXT failed due to POI issue

2016-01-19 Thread Jorge Spinsanti (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107212#comment-15107212
 ] 

Jorge Spinsanti commented on TIKA-1836:
---

POI issue was report in 2014-08-22. Perhaps if TIKA needs the fix, TIKA team 
can push to increase the priority/importance.

> Convertion DOC->TXT failed due to POI issue
> ---
>
> Key: TIKA-1836
> URL: https://issues.apache.org/jira/browse/TIKA-1836
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
> Environment: Distributor ID:  Ubuntu
> Description:  Ubuntu 12.04.5 LTS
> Release:  12.04
> Codename: precise
> java version "1.7.0_91"
> OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.12.04.1)
> OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode)
>Reporter: Jorge Spinsanti
> Attachments: test.doc
>
>
> When we try to convert DOC -> TXT, I got the next stack trace:
> {code}
> Caused by: org.apache.tika.exception.TikaException: Unexpected 
> RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1ddeedb6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 15 more
> Caused by: java.lang.UnsupportedOperationException: Non-extended character 
> Pascal strings are not supported right now. Please, contact POI developers 
> for update.
>   at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:82)
>   at org.apache.poi.hwpf.model.Sttb.(Sttb.java:61)
>   at 
> org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52)
>   at org.apache.poi.hwpf.model.SavedByTable.(SavedByTable.java:53)
>   at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:361)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 22 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1836) Convertion DOC->TXT failed due to POI issue

2016-01-19 Thread Jorge Spinsanti (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106919#comment-15106919
 ] 

Jorge Spinsanti edited comment on TIKA-1836 at 1/19/16 7:04 PM:


POI is a dependency of TIKA. I think TIKA can evaluate to migrate the use of 
POI to new version. Or perhaps, TIKA can be manage this issue trying an 
alternative idea.


was (Author: giorgy):
POI is a dependency of TIKA. I think TIKA can be evaluate to migrate the use of 
POI to new version. Or perhaps, TIKA can be manage this issue trying an 
alternative idea.

> Convertion DOC->TXT failed due to POI issue
> ---
>
> Key: TIKA-1836
> URL: https://issues.apache.org/jira/browse/TIKA-1836
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
> Environment: Distributor ID:  Ubuntu
> Description:  Ubuntu 12.04.5 LTS
> Release:  12.04
> Codename: precise
> java version "1.7.0_91"
> OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.12.04.1)
> OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode)
>Reporter: Jorge Spinsanti
> Attachments: test.doc
>
>
> When we try to convert DOC -> TXT, I got the next stack trace:
> {code}
> Caused by: org.apache.tika.exception.TikaException: Unexpected 
> RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1ddeedb6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 15 more
> Caused by: java.lang.UnsupportedOperationException: Non-extended character 
> Pascal strings are not supported right now. Please, contact POI developers 
> for update.
>   at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:82)
>   at org.apache.poi.hwpf.model.Sttb.(Sttb.java:61)
>   at 
> org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52)
>   at org.apache.poi.hwpf.model.SavedByTable.(SavedByTable.java:53)
>   at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:361)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 22 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1799) Upgrade to POI 3.14-Beta1 when available

2016-01-19 Thread Andreas Beeker (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107690#comment-15107690
 ] 

Andreas Beeker commented on TIKA-1799:
--

I have no idea how osgi bundling works, but adding the sub-packages (if the 
base package approach doesn't work) was the recommendation in my original mail 
[1]

I don't know what I should recommend here - why were originally only powerpoint 
and word optional?
What is the effect of providing packages via poi-ooxml-schema and marking them 
as optional?

[1] 
http://mail-archives.apache.org/mod_mbox/poi-dev/201601.mbox/%3c568d4ee1.7030...@apache.org%3E

> Upgrade to POI 3.14-Beta1 when available
> 
>
> Key: TIKA-1799
> URL: https://issues.apache.org/jira/browse/TIKA-1799
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Attachments: 349008.ppt, 349008.ppt.json
>
>
> Should be out in the next week or two.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1837) HtmlEncodingDetector wrongly detects charset from commented meta

2016-01-19 Thread Pascal Essiembre (JIRA)
Pascal Essiembre created TIKA-1837:
--

 Summary: HtmlEncodingDetector wrongly detects charset from 
commented meta
 Key: TIKA-1837
 URL: https://issues.apache.org/jira/browse/TIKA-1837
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.11
 Environment: Any.
Reporter: Pascal Essiembre
Priority: Minor


The org.apache.tika.parser.html.HtmlEncodingDetector class will grab the first 
meta tag that has a charset in it matching the pattern defined in 
HTTP_META_PATTERN. The problem encountered is when there are multiple such meta 
tags but the first ones are commented.  

In my mind the detector should not consider commented code for this detection. 

Real example encountered in an HTML page:

{code:xml}
   
   
{code}

The detector currently detects {{ISO-8859-1}} while it should detect {{utf-8}}.

*Fix:*

As opposed to modify the meta-detection regex, I recommend to first strip 
comments, taking into consideration the substring from the input stream may not 
hold the closing characters {{-->}}.  This has been tested to work:

{code:title=HtmlEncodingDetector.java, line 104+|borderStyle=solid}
String head = ASCII.decode(ByteBuffer.wrap(buffer, 0, n)).toString();

// START FIX:
head = head.replaceAll("|$)", "");
// END FIX

Matcher equiv = HTTP_META_PATTERN.matcher(head);
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1833) NoClassDefFoundError for POIXMLTypeLoader

2016-01-19 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106723#comment-15106723
 ] 

Tim Allison commented on TIKA-1833:
---

Ha.  Ok.  Great to hear.  It doesn't surprise me that there might yet be 
surprises, but this one was surprising. :)  Let us know when you find anything 
else that is curious, and happy extraction!

> NoClassDefFoundError for POIXMLTypeLoader
> -
>
> Key: TIKA-1833
> URL: https://issues.apache.org/jira/browse/TIKA-1833
> Project: Tika
>  Issue Type: Bug
>Reporter: Mohammed Manna
>
> I downloaded tika-app-1.11.jar which has all the necessary dependencies 
> (checked using 7zip opener and checked the classes). I tried to parse .doc, 
> .docx files for my project, but it is throwing error (not exception). The 
> stack trace is as follows:
> java.lang.NoClassDefFoundError: org/apache/poi/POIXMLTypeLoader
> at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown
>  Source)
> at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:158)
> at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:167)
> at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:119)
> at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:59)
> at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:204)
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at xxx.xxx.xxx.xxx.xAttachmentWithTika(xxxService.java:792)
> I browsed the package and couldn't find any POIXMLTypeLoader class. is this a 
> known issue? Could someone please respond to me?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1835) LinkContentHandler skips iframe and rel tags

2016-01-19 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated TIKA-1835:

Flags: Patch,Important  (was: Important)

> LinkContentHandler skips iframe and rel tags
> 
>
> Key: TIKA-1835
> URL: https://issues.apache.org/jira/browse/TIKA-1835
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.11
>Reporter: Markus Jelsma
> Fix For: 1.12
>
>
> As simple as it gets, link and iframe tags were never implemented in 
> LinkContentHandler. NUTCH-1233 kind of requires it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1835) LinkContentHandler skips iframe and rel tags

2016-01-19 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated TIKA-1835:

Attachment: TIKA-1835.patch

Patch for trunk. Adds support for iframe and link element link extraction. 
Tests included.

> LinkContentHandler skips iframe and rel tags
> 
>
> Key: TIKA-1835
> URL: https://issues.apache.org/jira/browse/TIKA-1835
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.11
>Reporter: Markus Jelsma
> Fix For: 1.12
>
> Attachments: TIKA-1835.patch
>
>
> As simple as it gets, link and iframe tags were never implemented in 
> LinkContentHandler. NUTCH-1233 kind of requires it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules

2016-01-19 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106752#comment-15106752
 ] 

Tim Allison commented on TIKA-1824:
---

Thank you, [~bobpaulin]!  Again, this is fantastic.  I should have a chance to 
take a look later today.  [~chrismattmann], [~gagravarr], [~kkrugler], 
[~lewismc],[~rgauss] or others, any feedback on this massive refactoring?

> Tika 2.0 -  Create Initial Parser Modules
> -
>
> Key: TIKA-1824
> URL: https://issues.apache.org/jira/browse/TIKA-1824
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Create initial break down of parser modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1836) Convertion DOC->TXT failed due to POI issue

2016-01-19 Thread Jorge Spinsanti (JIRA)
Jorge Spinsanti created TIKA-1836:
-

 Summary: Convertion DOC->TXT failed due to POI issue
 Key: TIKA-1836
 URL: https://issues.apache.org/jira/browse/TIKA-1836
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.11
 Environment: Distributor ID:   Ubuntu
Description:Ubuntu 12.04.5 LTS
Release:12.04
Codename:   precise

java version "1.7.0_91"
OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.12.04.1)
OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode)


Reporter: Jorge Spinsanti


When we try to convert DOC -> TXT, I got the next stack trace:
{code}
Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException 
from org.apache.tika.parser.microsoft.OfficeParser@1ddeedb6
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
... 15 more
Caused by: java.lang.UnsupportedOperationException: Non-extended character 
Pascal strings are not supported right now. Please, contact POI developers for 
update.
at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:82)
at org.apache.poi.hwpf.model.Sttb.(Sttb.java:61)
at 
org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52)
at org.apache.poi.hwpf.model.SavedByTable.(SavedByTable.java:53)
at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:361)
at 
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
... 22 more
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1836) Convertion DOC->TXT failed due to POI issue

2016-01-19 Thread Jorge Spinsanti (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Spinsanti updated TIKA-1836:
--
Component/s: parser

> Convertion DOC->TXT failed due to POI issue
> ---
>
> Key: TIKA-1836
> URL: https://issues.apache.org/jira/browse/TIKA-1836
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
> Environment: Distributor ID:  Ubuntu
> Description:  Ubuntu 12.04.5 LTS
> Release:  12.04
> Codename: precise
> java version "1.7.0_91"
> OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.12.04.1)
> OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode)
>Reporter: Jorge Spinsanti
> Attachments: test.doc
>
>
> When we try to convert DOC -> TXT, I got the next stack trace:
> {code}
> Caused by: org.apache.tika.exception.TikaException: Unexpected 
> RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1ddeedb6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 15 more
> Caused by: java.lang.UnsupportedOperationException: Non-extended character 
> Pascal strings are not supported right now. Please, contact POI developers 
> for update.
>   at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:82)
>   at org.apache.poi.hwpf.model.Sttb.(Sttb.java:61)
>   at 
> org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52)
>   at org.apache.poi.hwpf.model.SavedByTable.(SavedByTable.java:53)
>   at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:361)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 22 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1836) Convertion DOC->TXT failed due to POI issue

2016-01-19 Thread Jorge Spinsanti (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Spinsanti updated TIKA-1836:
--
Attachment: test.doc

File used to find the issue.

> Convertion DOC->TXT failed due to POI issue
> ---
>
> Key: TIKA-1836
> URL: https://issues.apache.org/jira/browse/TIKA-1836
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
> Environment: Distributor ID:  Ubuntu
> Description:  Ubuntu 12.04.5 LTS
> Release:  12.04
> Codename: precise
> java version "1.7.0_91"
> OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.12.04.1)
> OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode)
>Reporter: Jorge Spinsanti
> Attachments: test.doc
>
>
> When we try to convert DOC -> TXT, I got the next stack trace:
> {code}
> Caused by: org.apache.tika.exception.TikaException: Unexpected 
> RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1ddeedb6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 15 more
> Caused by: java.lang.UnsupportedOperationException: Non-extended character 
> Pascal strings are not supported right now. Please, contact POI developers 
> for update.
>   at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:82)
>   at org.apache.poi.hwpf.model.Sttb.(Sttb.java:61)
>   at 
> org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52)
>   at org.apache.poi.hwpf.model.SavedByTable.(SavedByTable.java:53)
>   at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:361)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 22 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1836) Convertion DOC->TXT failed due to POI issue

2016-01-19 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106908#comment-15106908
 ] 

Ken Krugler commented on TIKA-1836:
---

This seems to be an issue for POI, as per the message in the stack trace. Is 
there something you'd want Tika to do here?

> Convertion DOC->TXT failed due to POI issue
> ---
>
> Key: TIKA-1836
> URL: https://issues.apache.org/jira/browse/TIKA-1836
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
> Environment: Distributor ID:  Ubuntu
> Description:  Ubuntu 12.04.5 LTS
> Release:  12.04
> Codename: precise
> java version "1.7.0_91"
> OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.12.04.1)
> OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode)
>Reporter: Jorge Spinsanti
> Attachments: test.doc
>
>
> When we try to convert DOC -> TXT, I got the next stack trace:
> {code}
> Caused by: org.apache.tika.exception.TikaException: Unexpected 
> RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1ddeedb6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 15 more
> Caused by: java.lang.UnsupportedOperationException: Non-extended character 
> Pascal strings are not supported right now. Please, contact POI developers 
> for update.
>   at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:82)
>   at org.apache.poi.hwpf.model.Sttb.(Sttb.java:61)
>   at 
> org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52)
>   at org.apache.poi.hwpf.model.SavedByTable.(SavedByTable.java:53)
>   at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:361)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 22 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1836) Convertion DOC->TXT failed due to POI issue

2016-01-19 Thread Jorge Spinsanti (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106919#comment-15106919
 ] 

Jorge Spinsanti commented on TIKA-1836:
---

POI is a dependency of TIKA. I think TIKA can be evaluate to migrate the use of 
POI to new version. Or perhaps, TIKA can be manage this issue trying an 
alternative idea.

> Convertion DOC->TXT failed due to POI issue
> ---
>
> Key: TIKA-1836
> URL: https://issues.apache.org/jira/browse/TIKA-1836
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
> Environment: Distributor ID:  Ubuntu
> Description:  Ubuntu 12.04.5 LTS
> Release:  12.04
> Codename: precise
> java version "1.7.0_91"
> OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.12.04.1)
> OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode)
>Reporter: Jorge Spinsanti
> Attachments: test.doc
>
>
> When we try to convert DOC -> TXT, I got the next stack trace:
> {code}
> Caused by: org.apache.tika.exception.TikaException: Unexpected 
> RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1ddeedb6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 15 more
> Caused by: java.lang.UnsupportedOperationException: Non-extended character 
> Pascal strings are not supported right now. Please, contact POI developers 
> for update.
>   at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:82)
>   at org.apache.poi.hwpf.model.Sttb.(Sttb.java:61)
>   at 
> org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52)
>   at org.apache.poi.hwpf.model.SavedByTable.(SavedByTable.java:53)
>   at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:361)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 22 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)