[jira] [Assigned] (TIKA-719) Concurrent usage of HtmlParser causes infinite loop in HashMap

2011-09-19 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler reassigned TIKA-719: Assignee: Ken Krugler Concurrent usage of HtmlParser causes infinite loop in HashMap

[jira] [Created] (TIKA-720) EBCDIC encoding not detected

2011-09-19 Thread Michael McCandless (JIRA)
EBCDIC encoding not detected Key: TIKA-720 URL: https://issues.apache.org/jira/browse/TIKA-720 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless

[jira] [Updated] (TIKA-720) EBCDIC encoding not detected

2011-09-19 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-720: Attachment: English_EBCDIC.txt EBCDIC encoding not detected

[jira] [Created] (TIKA-721) UTF16-LE not detected

2011-09-19 Thread Michael McCandless (JIRA)
UTF16-LE not detected - Key: TIKA-721 URL: https://issues.apache.org/jira/browse/TIKA-721 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Priority: Minor

[jira] [Updated] (TIKA-721) UTF16-LE not detected

2011-09-19 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-721: Attachment: Chinese_Simplified_utf16.txt UTF16-LE not detected -

[jira] [Commented] (TIKA-705) Valid OOXML PPT file hits InvalidFormatException thrown in POI

2011-09-19 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13107959#comment-13107959 ] Nick Burch commented on TIKA-705: - Initial workaround committed in r1172690. The proper fix

[jira] [Commented] (TIKA-721) UTF16-LE not detected

2011-09-19 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13107969#comment-13107969 ] Nick Burch commented on TIKA-721: - In CharsetRecog_Unicode on line 69 (inside

[jira] [Commented] (TIKA-720) EBCDIC encoding not detected

2011-09-19 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13107977#comment-13107977 ] Nick Burch commented on TIKA-720: - A few IBM specific encodings are supported already in

[jira] [Updated] (TIKA-722) Arabic PDF doesn't extract correctly

2011-09-19 Thread Uwe Schindler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated TIKA-722: --- Attachment: metadata.png I checked this file: Thats exactly this type of file I am talking about, here

[jira] [Updated] (TIKA-722) Arabic PDF doesn't extract correctly

2011-09-19 Thread Uwe Schindler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated TIKA-722: --- Attachment: JUFO96.PDF Here is a non-persian example (which is actually a very-very early writeup from

[jira] [Updated] (TIKA-724) PDF text sometimes has extra space between letters

2011-09-19 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-724: Attachment: extraSpaces.pdf PDF text sometimes has extra space between letters

[jira] [Commented] (TIKA-720) EBCDIC encoding not detected

2011-09-19 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13108034#comment-13108034 ] Michael McCandless commented on TIKA-720: - Thanks Nick! That actually sounds