Shuai Liu created TIKA-1517:
-------------------------------

             Summary: MIME type selection with probability
                 Key: TIKA-1517
                 URL: https://issues.apache.org/jira/browse/TIKA-1517
             Project: Tika
          Issue Type: Improvement
          Components: mime
    Affects Versions: 1.6, 1.5, 1.4, 1.3, 1.2, 1.1, 1.0, 0.10, 0.9, 0.8, 0.7, 
0.6, 0.5, 0.4, 0.3, 0.2, 0.1-incubating
            Reporter: Shuai Liu


Problem and intuition
The original implementation in MIME type determination is a bit less flexible, 
and it heavily relies on the outcome of magic-bytes; Thus e.g. if magic-bytes 
is applicable in a file, Tika will follow the file type detected by magic-bytes.

This proposed approach slightly incorporate the Bayesian probability theorem, 
where users are able to assign weights to each approach in terms of 
probability, so they have the control over which file type or mime type 
identification methods implemented/available in Tika, and currently there are 3 
methods for identifying MIME type in Tika (i.e. Magic-Bytes, File extension and 
Metadata content-type hint). By introducing some weights on the approach in the 
proposed approach, users choose which method they trust most, the magic-bytes 
method is often trust-worthy though. But the virtue is that in some situations, 
file type identification must be sensitive, some might want each of the MIME 
type identification methods to arrive at the same file type before they start 
processing those file, incorrect file type identification is less intolerable. 
The current implementation seems to be less flexible and heavily rely on the 
Magic-bytes file identification method (although magic-bytes is most reliable 
compared to the other 2 currently being available in Tika); 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to