[jira] [Created] (TIKA-4491) The encoding format is ansi, GB18030 txt document, and the parsed content returns an empty String

yuying zhang (Jira) Fri, 19 Sep 2025 01:28:29 -0700

yuying zhang created TIKA-4491:
----------------------------------

             Summary: The encoding format is ansi, GB18030 txt document, and 
the parsed content returns an empty String
                 Key: TIKA-4491
                 URL: https://issues.apache.org/jira/browse/TIKA-4491
             Project: Tika
          Issue Type: Bug
          Components: detector, parser
    Affects Versions: 3.0.0
         Environment: Tika 3.0.0
            Reporter: yuying zhang



When I use AutoDetectParse to parse txt documents with encoding formats of ANSI 
and GB18030, the parsed content returns an empty string. When I checked 
AutoDetectParse calling ??parse (inputstream, handler, metadata, context) ??to 
parse text, I found that the returned type is application/octet stream, which 
is inconsistent with the text/plain returned by a txt document encoded in utf-8 
format. I tried to detect the file type through ??tika. detect (file)?? before 
calling the parse function and set it to the Content Type type of metadata, and 
the problem was solved.
Why does this problem occur? Why does ??detector. detect (tis, metadata) 
??return application/octet stream type, while ??tika.detect (file)?? returns 
text/plain type?
{code:java}
String type = tika.detect(file);
metadata.set(Metadata.CONTENT_TYPE,type);
autoDetectParser.parse(inputStream,handler,metadata,context);{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (TIKA-4491) The encoding format is ansi, GB18030 txt document, and the parsed content returns an empty String

Reply via email to