yuying zhang created TIKA-4491:
----------------------------------
Summary: The encoding format is ansi, GB18030 txt document, and
the parsed content returns an empty String
Key: TIKA-4491
URL: https://issues.apache.org/jira/browse/TIKA-4491
Project: Tika
Issue Type: Bug
Components: detector, parser
Affects Versions: 3.0.0
Environment: Tika 3.0.0
Reporter: yuying zhang
When I use AutoDetectParse to parse txt documents with encoding formats of ANSI
and GB18030, the parsed content returns an empty string. When I checked
AutoDetectParse calling ??parse (inputstream, handler, metadata, context) ??to
parse text, I found that the returned type is application/octet stream, which
is inconsistent with the text/plain returned by a txt document encoded in utf-8
format. I tried to detect the file type through ??tika. detect (file)?? before
calling the parse function and set it to the Content Type type of metadata, and
the problem was solved.
Why does this problem occur? Why does ??detector. detect (tis, metadata)
??return application/octet stream type, while ??tika.detect (file)?? returns
text/plain type?
{code:java}
String type = tika.detect(file);
metadata.set(Metadata.CONTENT_TYPE,type);
autoDetectParser.parse(inputStream,handler,metadata,context);{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)