[jira] [Commented] (TIKA-4491) The encoding format is ansi, GB18030 txt document, and the parsed content returns an empty String

Tim Allison (Jira) Fri, 17 Oct 2025 18:54:13 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18029507#comment-18029507
 ]


Tim Allison commented on TIKA-4491:
-----------------------------------

Text files are notoriously hard to detect.  Tika uses the file name as a hint 
if all of the other magics fail.
{quote}Why does this problem occur?
Why does {{detector.detect(tis, metadata)}} return 
{{{}application/octet-stream{}}}, {{tika.detect(file)}} returns 
{{{}text/plain{}}}?
{quote}
When you use {{{}tika.detect(file){}}}, Tika is including the file name as the 
final hint. When you use a FileInputStream, Tika can't see the name of the file 
and doesn't get that hint.

If you have a file, use TikaInputStream.get(file, metadata) always. If you need 
to use only an inputstream, try adding the file name as a hint in the metadata 
(if there's a file name) before the parse: 
{{metadata.set(TikaCoreProperties.RESOURCE_NAME_KEY, "file_name.txt");}}

> The encoding format is ansi, GB18030 txt document, and the parsed content 
> returns an empty String
> -------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-4491
>                 URL: https://issues.apache.org/jira/browse/TIKA-4491
>             Project: Tika
>          Issue Type: Bug
>          Components: detector, parser
>    Affects Versions: 3.0.0
>         Environment: Tika 3.0.0
> jdk21
>            Reporter: yuying zhang
>            Priority: Major
>         Attachments: image-2025-10-12-21-21-04-527.png, test_ansi.txt
>
>
> *Problem Description:*
> When using *AutoDetectParser* to parse {{.txt}} files encoded in *ANSI* or 
> {*}GB18030{*}, the parsed content is empty.
> Debugging shows that during the call:
> autoDetectParser.parse(inputStream, handler, metadata, context);
> the detected content type is:
> application/octet-stream
> !image-2025-10-12-21-21-04-527.png|width=528,height=327!
> However, {{.txt}} files encoded in *UTF-8* are correctly detected as 
> {{{}text/plain{}}}.
> I tried to detect the file type through tika. detect (file) before calling 
> the parse function and set it to the Content Type type of metadata, and the 
> problem was solved.
> {code:java}
> package org.example.documentparse;
> import org.apache.tika.Tika;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.sax.BodyContentHandler;
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.InputStream;
> public class TikaEncodingTest {
>     public static void main(String[] args) throws Exception {
>         // 1. Prepare test file (ANSI or GB18030 encoded txt)
>         File file = new File("D:\\javaWebLearn\\testFile\\test_ansi.txt"); // 
> or "sample-gb18030.txt"
>         // 2. Create Tika and AutoDetectParser
>         Tika tika = new Tika();
>         AutoDetectParser parser = new AutoDetectParser();
>         Metadata metadata = new Metadata();
>         BodyContentHandler handler = new BodyContentHandler();
>         ParseContext context = new ParseContext();
>         // 3. Parse using InputStream directly (content may be empty)
>         try (InputStream inputStream = new FileInputStream(file)) {
>             parser.parse(inputStream, handler, metadata, context);
>             System.out.println("Content parsed directly: " + 
> handler.toString());
>             System.out.println("Detected type: " + 
> metadata.get(Metadata.CONTENT_TYPE));
>         }
>         System.out.println("-----------------------------------------");
>         // 4. Use Tika.detect(file) to manually set Content-Type
>         String type = tika.detect(file);
>         metadata.set(Metadata.CONTENT_TYPE, type);
>         handler = new BodyContentHandler(); // reset handler
>         try (InputStream inputStream = new FileInputStream(file)) {
>             parser.parse(inputStream, handler, metadata, context);
>             System.out.println("Content after setting Content-Type: " + 
> handler.toString());
>             System.out.println("Metadata Content-Type: " + 
> metadata.get(Metadata.CONTENT_TYPE));
>         }
>     }
> }
>  {code}
> h3. *Question:*
> Why does this problem occur?
> Why does {{detector.detect(tis, metadata)}} return 
> {{{}application/octet-stream{}}}, {{tika.detect(file)}} returns 
> {{{}text/plain{}}}?
> h3. *Expected Behavior:*
>  * AutoDetectParser should correctly parse {{.txt}} files encoded in 
> ANSI/GB18030 without requiring manual content type setting.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4491) The encoding format is ansi, GB18030 txt document, and the parsed content returns an empty String

Reply via email to