[jira] [Updated] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2016-08-03 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2038: -- Attachment: comparisons_20160803b.xlsx Full results; fixed spurious extra rows in output > A more

[jira] [Updated] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2016-08-03 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2038: -- Attachment: (was: comparisons_20160803.xlsx) > A more accurate facility for detecting Charset

[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2016-08-03 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406408#comment-15406408 ] Tim Allison edited comment on TIKA-2038 at 8/3/16 6:51 PM: --- I wrote a markup

[jira] [Updated] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2016-08-03 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2038: -- Attachment: comparisons_20160803.xlsx I wrote a markup stripper that ignores content in tags, comments,

[jira] [Comment Edited] (TIKA-721) UTF16-LE not detected

2016-08-03 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405785#comment-15405785 ] Tim Allison edited comment on TIKA-721 at 8/3/16 12:03 PM: --- While working on

[jira] [Commented] (TIKA-721) UTF16-LE not detected

2016-08-03 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405785#comment-15405785 ] Tim Allison commented on TIKA-721: -- While working on TIKA-2038, I found that ICU4J is now correctly