[ https://issues.apache.org/jira/browse/TIKA-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13425981#comment-13425981 ]
Ray Gauss II edited comment on TIKA-965 at 7/31/12 6:18 PM: ------------------------------------------------------------ That's the solution I was looking into and I wanted to duplicate as little code as possible. Let me preface the rest of this by saying I don't know a whole lot about this character encoding and detection or bundling stuff. Here's the outline of what seems to be a working solution: # Move {{org.apache.tika.parsers.txt.Charset*}} to tika-core # Add a list of valid charsets (only UTF-8 at the moment) and minimum confidence level (80 at the moment) to {{TextDetector}} # If {{TextDetector}} comes up with {{isMostlyASCII=false}} fire up a {{CharsetDetector}} and check the match against valid charsets and minimum confidence above The only problem I'm running into with this approach is that to maintain backwards compatibility {{Charset*}} must reside in the same {{org.apache.tika.parser.txt}} package and tika-bundle throws a fit about that, coincidentally related to TIKA-966. For testing purposes I turned off export of {{org.apache.tika.parser.txt}} in tika-bundle but I'm sure that's not the solution we want. What do you all think of this approach, and if it is reasonable, what's the best way to handle the {{org.apache.tika.parser.txt}} conflict in tika-bundle? was (Author: rgauss): That's the solution I was looking into and I wanted to duplicate as little code as possible. Let me preface the rest of this by saying I don't know a whole lot about this character encoding and detection or bundling stuff. Here's the outline of what seems to be a working solution: # Move {{org.apache.tika.parsers.txt.Charset*}} to tika-core # Add a list of valid charsets (only UTF-8 at the moment) and minimum confidence level (80 at the moment) to {{TextDetector}} # If {{TextDetector}} comes up with mostly ASCII fire up a {{CharsetDetector}} and check the match against valid charsets and minimum confidence above The only problem I'm running into with this approach is that to maintain backwards compatibility {{Charset*}} must reside in the same {{org.apache.tika.parser.txt}} package and tika-bundle throws a fit about that, coincidentally related to TIKA-966. For testing purposes I turned off export of {{org.apache.tika.parser.txt}} in tika-bundle but I'm sure that's not the solution we want. What do you all think of this approach, and if it is reasonable, what's the best way to handle the {{org.apache.tika.parser.txt}} conflict in tika-bundle? > Text Detection Fails on Mostly Non-ASCII UTF-8 Files > ---------------------------------------------------- > > Key: TIKA-965 > URL: https://issues.apache.org/jira/browse/TIKA-965 > Project: Tika > Issue Type: Bug > Components: general > Affects Versions: 1.2 > Reporter: Ray Gauss II > > If a file contains relatively few ASCII characters and more 8 bit UTF-8 > characters the TextDetector and TextStatistics classes fail to detect it as > text. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira