[
https://issues.apache.org/jira/browse/TIKA-4431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
knoobie updated TIKA-4431:
--------------------------
Summary: Mime Type Detection Error with File Name containing Number Sign
(was: Mime Type Detection Error with File Naming containing Number Sign )
> Mime Type Detection Error with File Name containing Number Sign
> ----------------------------------------------------------------
>
> Key: TIKA-4431
> URL: https://issues.apache.org/jira/browse/TIKA-4431
> Project: Tika
> Issue Type: Bug
> Components: core
> Reporter: knoobie
> Priority: Major
>
> I noticed that changing the file name to include a number sign / hashtag (#)
> changes the mime type detection.
> For example, "Lorem-Ipsum.csv" correctly parses to "text/csv" but once
> "Lorem-Ipsum#123.csv" is given (with the same file content) the parser
> detects "text/plain".
>
> {code:java}
> import static org.assertj.core.api.Assertions.assertThat;
> import java.nio.charset.StandardCharsets;
> import org.apache.tika.Tika;
> import org.junit.jupiter.api.Test;
> public class ApacheTikaTest {
> @Test
> void detect_normalFileName() {
> var tika = new Tika();
> var fileName = "Lorem-Ipsum.csv";
> var data = """
> Lorem;Ipsum;
> 1 ;2 ;
> 3 ;4 ;
> """;
> assertThat(tika.detect(data.getBytes(StandardCharsets.UTF_8), fileName))
> .isEqualTo("text/csv");
> }
> @Test
> void detect_FileNameWithHashtag() {
> var tika = new Tika();
> var fileName = "Lorem-Ipsum#123.csv";
> var data = """
> Lorem;Ipsum;
> 1 ;2 ;
> 3 ;4 ;
> """;
> assertThat(tika.detect(data.getBytes(StandardCharsets.UTF_8), fileName))
> // Fails with result: 'text/plain'
> .isEqualTo("text/csv");
> }
> }
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)