[
https://issues.apache.org/jira/browse/TIKA-711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael McCandless reassigned TIKA-711:
---
Assignee: Michael McCandless
> Word parser doesn't extract optional hyphen correctl
[
https://issues.apache.org/jira/browse/TIKA-711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael McCandless updated TIKA-711:
Attachment: TIKA-711.patch
OK, after digging I found out that in fact POI's AbstractWordConv
[
https://issues.apache.org/jira/browse/TIKA-721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael McCandless reassigned TIKA-721:
---
Assignee: Michael McCandless
> UTF16-LE not detected
> -
>
>
[
https://issues.apache.org/jira/browse/TIKA-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119018#comment-13119018
]
Nick Burch commented on TIKA-721:
-
I'd suggest we check for invalid UTF-16 sequences (see
h
[
https://issues.apache.org/jira/browse/TIKA-721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael McCandless updated TIKA-721:
Attachment: TIKA-721.patch
Attached patch, using three simple heuristics:
First, I compute t
[
https://issues.apache.org/jira/browse/TIKA-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119035#comment-13119035
]
Michael McCandless commented on TIKA-721:
-
bq. I'd suggest we check for invalid UTF-
[
https://issues.apache.org/jira/browse/TIKA-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119038#comment-13119038
]
Robert Muir commented on TIKA-721:
--
{quote}
Finally, for the valid code points, I count how
[
https://issues.apache.org/jira/browse/TIKA-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119044#comment-13119044
]
Michael McCandless commented on TIKA-721:
-
{quote}
bq. Finally, for the valid code
[
https://issues.apache.org/jira/browse/TIKA-713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119060#comment-13119060
]
Robert Muir commented on TIKA-713:
--
I created PDFBOX-1127 for this with some screenshots an