knoobie created TIKA-4431:
-----------------------------
Summary: Mime Type Detection Error with File Naming containing
Number Sign
Key: TIKA-4431
URL: https://issues.apache.org/jira/browse/TIKA-4431
Project: Tika
Issue Type: Bug
Components: core
Environment: {code:xml}
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<artifactId>3.1.0</artifactId>
</dependency>
{code}
Reporter: knoobie
I noticed that changing the file name to include a number sign / hashtag (#)
changes the mime type detection.
For example, "Lorem-Ipsum.csv" correctly parses to "text/csv" but once
"Lorem-Ipsum#123.csv" is given (with the same file content) the parser detects
"text/plain".
{code:java}
import static org.assertj.core.api.Assertions.assertThat;
import java.nio.charset.StandardCharsets;
import org.apache.tika.Tika;
import org.junit.jupiter.api.Test;
public class ApacheTikaTest {
@Test
void detect_normalFileName() {
var tika = new Tika();
var fileName = "Lorem-Ipsum.csv";
var data = """
Lorem;Ipsum;
1 ;2 ;
3 ;4 ;
""";
assertThat(tika.detect(data.getBytes(StandardCharsets.UTF_8), fileName))
.isEqualTo("text/csv");
}
@Test
void detect_FileNameWithHashtag() {
var tika = new Tika();
var fileName = "Lorem-Ipsum#123.csv";
var data = """
Lorem;Ipsum;
1 ;2 ;
3 ;4 ;
""";
assertThat(tika.detect(data.getBytes(StandardCharsets.UTF_8), fileName))
// Fails with result: 'text/plain'
.isEqualTo("text/csv");
}
}
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)