[ https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14570250#comment-14570250 ]
Chris A. Mattmann commented on TIKA-1517: ----------------------------------------- quick advantage I can see - their code is GPLv2 - ours is ALv2 :) But would be good for apples to apples compare. > MIME type selection with probability > ------------------------------------ > > Key: TIKA-1517 > URL: https://issues.apache.org/jira/browse/TIKA-1517 > Project: Tika > Issue Type: Improvement > Components: mime > Affects Versions: 0.1-incubating, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, > 0.10, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6 > Reporter: Luke sh > Assignee: Chris A. Mattmann > Priority: Trivial > Labels: memex > Fix For: 1.9 > > Attachments: BaysianTest.java > > > Improvement and intuition > The original implementation for MIME type selection/detection is a bit less > flexible by initial design, as it heavily relies on the outcome produced by > magic-bytes MIME Type identification; Thus e.g. if magic-bytes is applicable > in a file, Tika will follow the file type detected by magic-bytes. It may be > better to provide more control over the method of choice. > This proposed approach slightly incorporate the Bayesian probability theorem, > where users are able to assign weights to each approach in terms of > probability, so they have the control over preference of which file type or > mime type identification methods implemented/available in Tika, and currently > there are 3 methods for identifying MIME type in Tika (i.e. Magic-Bytes, File > extension and Metadata content-type hint). By introducing some weights on the > approach in the proposed approach, users are able to choose which method they > trust most, the magic-bytes method is often trust-worthy though. But the > virtue is that in some situations, file type identification must be > sensitive, some might want all of the MIME type identification methods to > agree on the same file type before they start processing those files, > incorrect file type identification is less intolerable. The current > implementation seems to be less flexible for this purpose and heavily rely on > the Magic-bytes file identification method (although magic-bytes is most > reliable compared to the other 2 ); > Proposed design: > The idea of selection is to incorporate probability as weights on each MIME > type identification method currently being implemented in Tika (they are > Magic bytes approach, file extension match and metadata content-type hint). > for example, > as an user, i would probably like to assign the the preference to the method > based on the degree of the trust, and order the results if they don't > coincide. > Bayesian rule may be a bit appropriate here to meet the intuition. > The following is what are needed for Bayesian rule implementation. > > Prior probability P(file_type) e.g. P(pdf), theoretically this is computed > > based on the samples, and this depends on the domain or use cases, > > intuitively we more care the orders of the weights or probability of the > > results rather than the actual numbers, and also the context of Prior > > depends on samples for a particular use case or domain, e.g. if we happen > > to crawl a website that contains mostly the pdf files, we probably can > > collect some samples and compute the prior, based on the samples we can say > > 90% of docs are pdf, so our prior is defined to be P(pdf) = 0.9, but here > > we propose to define the prior as configurable param for users, and by > > default we leave the prior to be "unapplicable". Alternatively, we can > > define prior for each file type to be 1/[number of supported file types in > > Tika] I think the number would be approximately 1/1157 and using this > > number seems to be more fair, but the point of avoiding it is that this > > prior is fixed for every type, and eventually we care more the orders of > > the result and if the number is fixed, so will the order be, bringing this > > number of 1/1157 into the Bayesian equation will not only be unable to > > affect the order but also it will lumber our implementation with extra > > computation, thus we will leave it as "unapplicable" which means we assign > > 1 to it as it never exists! but note we care more the order rather the > > actual number, and this param is configurable, and we believe it provides > > much flexibilities in some use cases. > > Conditional probability of positive tests given a file type P(test| > > file_type) e.g. P(test1 = pdf | pdf), this probability is also based on > > collection of samples and domain or use cases, we leave it configurable, > > but based on our intuition we think test1(i.e. Magic-bytes method) is most > > trustworthy, thus the default value is 0.75 for P(test1 = a_file_type | > > a_file_type), this is to say given the file whose type is "a file type", > > the probability of the test1 predicting the file is "a_file_type" is 0.75, > > that is really our intuition, as we trust test1 most, next we propose to > > use 0.7 for test3, and 0.65 for test2; > (note again, test1 = magic-bytes, test2 = file extension, test3 = Metadata > Content-type hint) > > Conditional probability of negative tests also need to be intuitively > > defined. > E.g. By default, given a file type that is not pdf, the probability of test1 > predicting it is pdf is 1-P(test1 = pdf | pdf), thus P(test1=pdf | ~pdf) = 1- > 0.75 = 0.25, as we trust the test1 the most, the other tests are defined with > 0.35 and 0.3 respectively with the same intuition. > > >> The goal is to find out > P(file_type | test1 = file_type, test2=file_type, test3=file_type) > (Please note, we are mostly interested in the order of choice rather than the > explicit computation, we selectively drop some of the parameters used in > Bayesian rule. Those are not considered will by default be set to 1 .) > For example, given a file the 3 tests have predicted as follows > test1 = pdf > test2 = pdf > test3 = pdf > prior: P(pdf) = 1 and P(~pdf) = 1 (meaning they are not applicable ) > P(test1=pdf|pdf) = 0.75 > P(test2=pdf|pdf) =0.65 > P(test3=pdf|pdf) = 0.7 > With the same concept or intuition, we have the negative conditional > probability by default > P(test1=pdf|~pdf) = 0.25 > P(test2=pdf|~pdf) =0.35 > P(test3=pdf|~pdf) = 0.3 > Then we ready to compute. > Our goal is P(pdf|test1=pdf, test2=pdf, test3=pdf) > P(pdf|test1=pdf, test2=pdf, test3=pdf) = [P(pdf) * P(test1=pdf|pdf) * > P(test2=pdf|pdf) * P(test3=pdf|pdf)]/total probability > where > total probability = P(pdf) * P(test1=pdf|pdf) * P(test2=pdf|pdf) * > P(test3=pdf|pdf) + P(~pdf) P(test1=pdf|~pdf) * P(test2=pdf|~pdf) * > P(test3=pdf|~pdf) = 0.3675 > Thus, > P(pdf|test1=pdf, test2=pdf, test3=pdf) = 0.92857 > --------------------------------------------------------------------------------- > example 2 > test1=pdf > test2=txt > test3=txt > In this example, test2 and test3 does not agree test1. > So we have 2 types to compare, let's compute the 2 file type probabilities > with conditions on those test results. > for simplicity, > test1=pdf, i will write test1+ > > pdf > P(\+|test1+, test2-, test3-) > = [P(+)P(test1+|pdf)*P(test2-|pdf)*P(test3-|pdf)]/total probability > = 0.40909 > >text > P(\+|test1-, test2+, test3+) = 0.590909 > --------------------------------------------------------------------------------- > example 3 > test1=pdf > test2=txt > test3=uc > > pdf > P(\+|test1+, test2-, test3-) = 0.40909 > >txt > P(\+|test1-, test2+, test3-) = 0.20968 > >nc > P(\+|test1-, test2-, test3+) = 0.29518 > --------------------------------------------------------------------------------- > Since we are more interested in the weight order in a way we prefer to put > more weight on the methods with higher preference, we can further simplify > computation by ignoring the probability of the tests that have negative > prediction. > Consider the example 3 above, > > pdf > P(\+|test1+, test2-, test3-) = 0.40909 > we can ignore probability of test2- and test3-, as we are more interested in > the order of preference; > the equation can be rewriten as follows > P(\+|test1+) = 0.75/(0.75+0.35) = 0.75 where the total probability becomes 1, > note prior is set to 1 by default for simplicity too. > Similarly, we have > >txt > P(\+|test2+) = 0.65 > >uc > P(\+|test3+)=0.7 > This follows the initial intuitive assumption and the intuitive order is also > preserved. > some of the parameters are being left out for computation simplicity by > default, but the goal is to provide a way thru which users are able to > control which method they want to use with probability weights, and this also > provides some rooms or flexibility for more MIME detection algorithms. -- This message was sent by Atlassian JIRA (v6.3.4#6332)