Hi,
I encountered this same xml content-type detection issue with text/plain for 
.xml files, so that was useful info about the xml declaration being required 
for mime-type to be set to application/xml.

Most the incoming XML files to our system does not contain the declaration, so 
it throws false positives in our current Tika file validation logic.

Our validation takes the incoming file and sends to /meta for analysis. The 
content-type from that is sent to /mime-types endpoint and used to compare with 
the incoming file extension using defaultExtension.

Since text/plain is returned from the content-type, we get a false positive as 
there is no indication in /mime-types/text/plain that this can also be .xml

Would it be possible to add .xml in extensions node for text/plain?

If not, is there another way to have a connection from text/plain to 
application/xml?

Thanks,

Regards,
Willy

Tor 13 jul 2023 kl. 22:22 skrev Tim Allison:
> I wasn't around on the project when the xml mime magic was developed.  So, 
> take this as personal opinion, not an official statement. :D
> 
> The first item is intentional (xml data with no declaration).  Text-based 
> files are challenging, and looking for matching tags is beyond what our 
> current detection does...not to say that it would be impossible.  We do allow 
> a missing declaration for specific subtypes, such as svg, IIRC.
> 
> The second item is surprising because it looks like we should only require 
> '<?xml' at offset 0. I'll look into that tomorrow.
> 
> On Wed, Jul 12, 2023 at 11:58 AM John Ulrik <uja...@gmail.com> wrote:
>> Hi everyone, 
>> 
>> Tika (testing with v2.8.0, but doesn't seem to be version-specific) seems to 
>> detect generic XML depending on the existence of and details on the XML 
>> declaration:
>> 
>> @Test
>> public void testDetect() throws IOException {
>> try (final InputStream in = new BufferedInputStream(new 
>> ByteArrayInputStream("*<data>42</data>*".getBytes(StandardCharsets.*US_ASCII*))))
>>  {
>> *assertEquals*(MediaType.**TEXT_PLAIN**, new Tika().getDetector().detect(in, 
>> new Metadata()).getBaseType());
>> }
>> try (final InputStream in = new BufferedInputStream(new 
>> ByteArrayInputStream("*<?xml?><data>42</data>*".getBytes(StandardCharsets.*US_ASCII*))))
>>  {
>> *assertEquals*(MediaType.**TEXT_PLAIN**, new Tika().getDetector().detect(in, 
>> new Metadata()).getBaseType());
>> }
>> try (final InputStream in = new BufferedInputStream(new 
>> ByteArrayInputStream("*<?xml 
>> version='1.0'?><data>42</data>*".getBytes(StandardCharsets.*US_ASCII*)))) {
>> *assertEquals*(MediaType.**APPLICATION_XML**, new 
>> Tika().getDetector().detect(in, new Metadata()).getBaseType());
>> }
>> }
>> 
>> In short, only XML files with an XML declarations that explicity includes an 
>> encoding will be detected as application/xml. XML files without XML 
>> declaration or with an XML declaration but without encoding will be detected 
>> as text/plain.
>> 
>> Is that intentional?
>> 
>> Thanks
>> John
>> 

Reply via email to