[jira] [Comment Edited] (TIKA-965) Text Detection Fails on Mostly Non-ASCII UTF-8 Files

Ray Gauss II (JIRA) Tue, 31 Jul 2012 11:18:37 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13425981#comment-13425981
 ]


Ray Gauss II edited comment on TIKA-965 at 7/31/12 6:18 PM:
------------------------------------------------------------

That's the solution I was looking into and I wanted to duplicate as little code 
as possible.

Let me preface the rest of this by saying I don't know a whole lot about this 
character encoding and detection or bundling stuff.

Here's the outline of what seems to be a working solution:

# Move {{org.apache.tika.parsers.txt.Charset*}} to tika-core
# Add a list of valid charsets (only UTF-8 at the moment) and minimum 
confidence level (80 at the moment) to {{TextDetector}}
# If {{TextDetector}} comes up with {{isMostlyASCII=false}} fire up a 
{{CharsetDetector}} and check the match against valid charsets and minimum 
confidence above

The only problem I'm running into with this approach is that to maintain 
backwards compatibility {{Charset*}} must reside in the same 
{{org.apache.tika.parser.txt}} package and tika-bundle throws a fit about that, 
coincidentally related to TIKA-966.  For testing purposes I turned off export 
of {{org.apache.tika.parser.txt}} in tika-bundle but I'm sure that's not the 
solution we want.

What do you all think of this approach, and if it is reasonable, what's the 
best way to handle the {{org.apache.tika.parser.txt}} conflict in tika-bundle?
                
      was (Author: rgauss):
    That's the solution I was looking into and I wanted to duplicate as little 
code as possible.

Let me preface the rest of this by saying I don't know a whole lot about this 
character encoding and detection or bundling stuff.

Here's the outline of what seems to be a working solution:

# Move {{org.apache.tika.parsers.txt.Charset*}} to tika-core
# Add a list of valid charsets (only UTF-8 at the moment) and minimum 
confidence level (80 at the moment) to {{TextDetector}}
# If {{TextDetector}} comes up with mostly ASCII fire up a {{CharsetDetector}} 
and check the match against valid charsets and minimum confidence above

The only problem I'm running into with this approach is that to maintain 
backwards compatibility {{Charset*}} must reside in the same 
{{org.apache.tika.parser.txt}} package and tika-bundle throws a fit about that, 
coincidentally related to TIKA-966.  For testing purposes I turned off export 
of {{org.apache.tika.parser.txt}} in tika-bundle but I'm sure that's not the 
solution we want.

What do you all think of this approach, and if it is reasonable, what's the 
best way to handle the {{org.apache.tika.parser.txt}} conflict in tika-bundle?
                  
> Text Detection Fails on Mostly Non-ASCII UTF-8 Files
> ----------------------------------------------------
>
>                 Key: TIKA-965
>                 URL: https://issues.apache.org/jira/browse/TIKA-965
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 1.2
>            Reporter: Ray Gauss II
>
> If a file contains relatively few ASCII characters and more 8 bit UTF-8 
> characters the TextDetector and TextStatistics classes fail to detect it as 
> text.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (TIKA-965) Text Detection Fails on Mostly Non-ASCII UTF-8 Files

Reply via email to