[jira] [Commented] (TIKA-4491) The encoding format is ansi, GB18030 txt document, and the parsed content returns an empty String

Tilman Hausherr (Jira) Sat, 18 Oct 2025 09:19:26 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18029313#comment-18029313
 ]


Tilman Hausherr commented on TIKA-4491:
---------------------------------------

Modified code that doesn't need local download
{code:java}
    public static void main(String[] args) throws Exception
    {
        // 1. Prepare test file (ANSI or GB18030 encoded txt)
        Path tempFile = Files.createTempFile("tika", ".txt");
        try (InputStream is = new 
URL("https://issues.apache.org/jira/secure/attachment/13078783/test_ansi.txt";).openStream())
        {
            Files.copy(is, tempFile, StandardCopyOption.REPLACE_EXISTING);
        }

        // 2. Create Tika and AutoDetectParser
        Tika tika = new Tika();
        AutoDetectParser parser = new AutoDetectParser();
        Metadata metadata = new Metadata();
        BodyContentHandler handler = new BodyContentHandler();
        ParseContext context = new ParseContext();
        // 3. Parse using InputStream directly (content may be empty)
        try (InputStream inputStream = new FileInputStream(tempFile.toFile()))
        {
            parser.parse(inputStream, handler, metadata, context);
            System.out.println("Content parsed directly: " + 
handler.toString());
            System.out.println("Detected type: " + 
metadata.get(Metadata.CONTENT_TYPE));
        }

        System.out.println("-----------------------------------------");

        // 4. Use Tika.detect(file) to manually set Content-Type
        String type = tika.detect(tempFile.toFile());
        System.out.println("detected type: " + type);
        metadata.set(Metadata.CONTENT_TYPE, type);
        handler = new BodyContentHandler(); // reset handler
        try (InputStream inputStream = new FileInputStream(tempFile.toFile()))
        {
            parser.parse(inputStream, handler, metadata, context);
            System.out.println("Content after setting Content-Type: " + 
handler.toString());
            System.out.println("Metadata Content-Type: " + 
metadata.get(Metadata.CONTENT_TYPE));
        }
        
        Files.delete(tempFile);
    }
{code}

> The encoding format is ansi, GB18030 txt document, and the parsed content 
> returns an empty String
> -------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-4491
>                 URL: https://issues.apache.org/jira/browse/TIKA-4491
>             Project: Tika
>          Issue Type: Bug
>          Components: detector, parser
>    Affects Versions: 3.0.0
>         Environment: Tika 3.0.0
> jdk21
>            Reporter: yuying zhang
>            Priority: Major
>         Attachments: image-2025-10-12-21-21-04-527.png, test_ansi.txt
>
>
> *Problem Description:*
> When using *AutoDetectParser* to parse {{.txt}} files encoded in *ANSI* or 
> {*}GB18030{*}, the parsed content is empty.
> Debugging shows that during the call:
> autoDetectParser.parse(inputStream, handler, metadata, context);
> the detected content type is:
> application/octet-stream
> !image-2025-10-12-21-21-04-527.png|width=528,height=327!
> However, {{.txt}} files encoded in *UTF-8* are correctly detected as 
> {{{}text/plain{}}}.
> I tried to detect the file type through tika. detect (file) before calling 
> the parse function and set it to the Content Type type of metadata, and the 
> problem was solved.
> {code:java}
> package org.example.documentparse;
> import org.apache.tika.Tika;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.sax.BodyContentHandler;
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.InputStream;
> public class TikaEncodingTest {
>     public static void main(String[] args) throws Exception {
>         // 1. Prepare test file (ANSI or GB18030 encoded txt)
>         File file = new File("D:\\javaWebLearn\\testFile\\test_ansi.txt"); // 
> or "sample-gb18030.txt"
>         // 2. Create Tika and AutoDetectParser
>         Tika tika = new Tika();
>         AutoDetectParser parser = new AutoDetectParser();
>         Metadata metadata = new Metadata();
>         BodyContentHandler handler = new BodyContentHandler();
>         ParseContext context = new ParseContext();
>         // 3. Parse using InputStream directly (content may be empty)
>         try (InputStream inputStream = new FileInputStream(file)) {
>             parser.parse(inputStream, handler, metadata, context);
>             System.out.println("Content parsed directly: " + 
> handler.toString());
>             System.out.println("Detected type: " + 
> metadata.get(Metadata.CONTENT_TYPE));
>         }
>         System.out.println("-----------------------------------------");
>         // 4. Use Tika.detect(file) to manually set Content-Type
>         String type = tika.detect(file);
>         metadata.set(Metadata.CONTENT_TYPE, type);
>         handler = new BodyContentHandler(); // reset handler
>         try (InputStream inputStream = new FileInputStream(file)) {
>             parser.parse(inputStream, handler, metadata, context);
>             System.out.println("Content after setting Content-Type: " + 
> handler.toString());
>             System.out.println("Metadata Content-Type: " + 
> metadata.get(Metadata.CONTENT_TYPE));
>         }
>     }
> }
>  {code}
> h3. *Question:*
> Why does this problem occur?
> Why does {{detector.detect(tis, metadata)}} return 
> {{{}application/octet-stream{}}}, {{tika.detect(file)}} returns 
> {{{}text/plain{}}}?
> h3. *Expected Behavior:*
>  * AutoDetectParser should correctly parse {{.txt}} files encoded in 
> ANSI/GB18030 without requiring manual content type setting.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4491) The encoding format is ansi, GB18030 txt document, and the parsed content returns an empty String

Reply via email to