[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-05 Thread Hans Brende (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16676109#comment-16676109 ] Hans Brende commented on TIKA-2771: --- [~talli...@apache.org] I did a little experimentati

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-05 Thread Hans Brende (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675880#comment-16675880 ] Hans Brende commented on TIKA-2771: --- [~wave] Yep, just ran the following {code:java} Int

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-05 Thread David Fisher (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675841#comment-16675841 ] David Fisher commented on TIKA-2771: I looked at the EBCDIC500 codepage online and was

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675839#comment-16675839 ] Tim Allison commented on TIKA-2771: --- I was thinking something similar... > enableInputF

[jira] [Comment Edited] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-05 Thread Hans Brende (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675828#comment-16675828 ] Hans Brende edited comment on TIKA-2771 at 11/5/18 10:44 PM: -

[jira] [Comment Edited] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-05 Thread Hans Brende (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675828#comment-16675828 ] Hans Brende edited comment on TIKA-2771 at 11/5/18 10:44 PM: -

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-05 Thread Hans Brende (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675828#comment-16675828 ] Hans Brende commented on TIKA-2771: --- [~talli...@apache.org] Ah, you're correct as regard

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675740#comment-16675740 ] Tim Allison commented on TIKA-2771: --- Got it. Thank you. bq. which calls: match(det, ng

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-05 Thread Hans Brende (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675708#comment-16675708 ] Hans Brende commented on TIKA-2771: --- [~talli...@apache.org] I'm not sure which all of th

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675527#comment-16675527 ] Tim Allison commented on TIKA-2771: --- I'm happy enough adding this check into EBCDIC500.

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675520#comment-16675520 ] Tim Allison commented on TIKA-2771: --- When I add a {{tagsWereStripped}}, and have the EBC

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675511#comment-16675511 ] Tim Allison commented on TIKA-2771: --- Let me try again. I _think_ I've re-engaged my bra

[jira] [Created] (TIKA-2772) Problem if cell contains quotation marks (")

2018-11-05 Thread ionut hodor (JIRA)
ionut hodor created TIKA-2772: - Summary: Problem if cell contains quotation marks (") Key: TIKA-2772 URL: https://issues.apache.org/jira/browse/TIKA-2772 Project: Tika Issue Type: Bug C

[jira] [Commented] (TIKA-2750) Update regression corpus

2018-11-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675342#comment-16675342 ] Tim Allison commented on TIKA-2750: --- To my query above about jacoco, see the responses b

[jira] [Commented] (TIKA-2750) Update regression corpus

2018-11-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675322#comment-16675322 ] Tim Allison commented on TIKA-2750: --- I just added charset and lang by tld in last month'

[jira] [Updated] (TIKA-2750) Update regression corpus

2018-11-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2750: -- Attachment: CC-MAIN-2018-39-charset_lang_by_tld.zip > Update regression corpus > ---

[jira] [Commented] (TIKA-2765) Regression extracting text from corrupted docx files

2018-11-05 Thread Luis Filipe Nassif (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675281#comment-16675281 ] Luis Filipe Nassif commented on TIKA-2765: -- POI-62886 created. Thanks [~talli...@

[jira] [Commented] (TIKA-2767) Problem with import xlsx with null cells

2018-11-05 Thread ionut hodor (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16674833#comment-16674833 ] ionut hodor commented on TIKA-2767: --- Hi [~davemeikle], thank you to answered me, i atta

[jira] [Updated] (TIKA-2767) Problem with import xlsx with null cells

2018-11-05 Thread ionut hodor (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ionut hodor updated TIKA-2767: -- Attachment: exampleXLS.xls exampleXLSX.xlsx > Problem with import xlsx with null cells >

[jira] [Commented] (TIKA-2767) Problem with import xlsx with null cells

2018-11-05 Thread ionut hodor (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16674826#comment-16674826 ] ionut hodor commented on TIKA-2767: --- Hi [~davemeikle] I have 2 example for you > Probl

[jira] [Issue Comment Deleted] (TIKA-2767) Problem with import xlsx with null cells

2018-11-05 Thread ionut hodor (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ionut hodor updated TIKA-2767: -- Comment: was deleted (was: Hi [~davemeikle] I have 2 example for you) > Problem with import xlsx with