[jira] [Created] (TIKA-4195) JSoupParser conceals null from the EncodingDetector

Tim Allison (Jira) Mon, 12 Feb 2024 08:25:05 -0800

Tim Allison created TIKA-4195:
---------------------------------

             Summary: JSoupParser conceals null from the EncodingDetector
                 Key: TIKA-4195
                 URL: https://issues.apache.org/jira/browse/TIKA-4195
             Project: Tika
          Issue Type: Improvement
            Reporter: Tim Allison



The JSoupParser is runs encoding detection on the inputstream. If the result is 
null, the parser applies the default charset -- US-ASCII. This behavior is ok. 

The problem is that there is no way to distinguish when a faulty encoding 
detector alleges 'US-ASCII' and the default behavior of the JSoupParser. I 
don't think the JSoupParser should report the fallback encoding as if it were 
detected.

I'm not sure how best to report this in the metadata, but we need to be able to 
differentiate detection and fallback encoding.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (TIKA-4195) JSoupParser conceals null from the EncodingDetector

Reply via email to