[jira] [Assigned] (TIKA-711) Word parser doesn't extract optional hyphen correctly

2011-10-02 Thread Michael McCandless (Assigned) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned TIKA-711: --- Assignee: Michael McCandless > Word parser doesn't extract optional hyphen correctl

[jira] [Updated] (TIKA-711) Word parser doesn't extract optional hyphen correctly

2011-10-02 Thread Michael McCandless (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-711: Attachment: TIKA-711.patch OK, after digging I found out that in fact POI's AbstractWordConv

[jira] [Assigned] (TIKA-721) UTF16-LE not detected

2011-10-02 Thread Michael McCandless (Assigned) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned TIKA-721: --- Assignee: Michael McCandless > UTF16-LE not detected > - > >

[jira] [Commented] (TIKA-721) UTF16-LE not detected

2011-10-02 Thread Nick Burch (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119018#comment-13119018 ] Nick Burch commented on TIKA-721: - I'd suggest we check for invalid UTF-16 sequences (see h

[jira] [Updated] (TIKA-721) UTF16-LE not detected

2011-10-02 Thread Michael McCandless (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-721: Attachment: TIKA-721.patch Attached patch, using three simple heuristics: First, I compute t

[jira] [Commented] (TIKA-721) UTF16-LE not detected

2011-10-02 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119035#comment-13119035 ] Michael McCandless commented on TIKA-721: - bq. I'd suggest we check for invalid UTF-

[jira] [Commented] (TIKA-721) UTF16-LE not detected

2011-10-02 Thread Robert Muir (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119038#comment-13119038 ] Robert Muir commented on TIKA-721: -- {quote} Finally, for the valid code points, I count how

[jira] [Commented] (TIKA-721) UTF16-LE not detected

2011-10-02 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119044#comment-13119044 ] Michael McCandless commented on TIKA-721: - {quote} bq. Finally, for the valid code

[jira] [Commented] (TIKA-713) Tika can not parse all of the persian pdf files

2011-10-02 Thread Robert Muir (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119060#comment-13119060 ] Robert Muir commented on TIKA-713: -- I created PDFBOX-1127 for this with some screenshots an