[ 
https://issues.apache.org/jira/browse/TIKA-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14512284#comment-14512284
 ] 

Luke sh commented on TIKA-1517:
-------------------------------

Notes: 
 A Pull request with adding the support to Tika() facet.
If a user wants to use this feature, the following code would be needed.

tika = new Tika(new TikaConfig() {
       @Override
protected Detector getDefaultDetector(MimeTypes types,ServiceLoader loader) {
       /*
           * here is an example with the use of the builder to
           * instantiate the object.
        */
           Builder builder = new ProbabilisticMimeDetectionSelector.Builder();
ProbabilisticMimeDetectionSelector proDetector = new 
ProbabilisticMimeDetectionSelector(
                            types, builder.priorMagicFileType(0.5f)
                                    .priorExtensionFileType(0.5f)
                                    .priorMetaFileType(0.5f));
                    return new DefaultProbDetector(proDetector, loader);
           }
   });
The idea is simple that we overwrite the getDefaultDetector() by providing the 
DefaultProbDetector which extends the CompositeDetector, a CompositeDetector is 
one (whose supertype is Detector) that takes a list of detectors, and when its 
detect() method gets called, each detector in the list is called sequentially 
one after another. The original implemenation of getDefaultDetector() in 
TikaConfig returns an instance of “DefaultDetector” that also extends the 
CompositeDetector by providing a list of detectors that includes “MimeTypes” 
which is the native implemenation with 3 detectors (i.e. magic bytes, extension 
and metadatahint). However, DefaultProbDetector replaces this MimeTypes with 
ProbabilisticMimeDetectionSelector.

In order to set the preferential weights, an instance of 
ProbabilisticMimeDetectionSelector can be created in the above example snippet.

Alternatively, if we dont want to go with the default settings with 
ProbabilisticMimeDetectionSelector, it is ok to just ignore the arguments by 
calling only “return new DefaultProbDetector();”.
Alternatively, if we dont want to write some extra code, the following can also 
be used.
/*
 * an xml file needs to be given to tell where the detector is located, 
customers can build      
 * their own detectors by including or excluding this feature or any detectors 
in the    
 * composite list at their will.
 */
Tika tika = new Tika(new TikaConfig(new File("TIKA-detector-sample.xml")));


TIKA-detector-sample.xml
<?xml version="1.0" encoding="UTF-8"?>
<!--
  Licensed to the Apache Software Foundation (ASF) under one or more
  contributor license agreements.  See the NOTICE file distributed with
  this work for additional information regarding copyright ownership.
  The ASF licenses this file to You under the Apache License, Version 2.0
  (the "License"); you may not use this file except in compliance with
  the License.  You may obtain a copy of the License at

  http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.
-->
<properties>
  <detectors>
    <detector class="org.apache.tika.detect.DefaultProbDetector"/>
  </detectors>
</properties>




> MIME type selection with probability
> ------------------------------------
>
>                 Key: TIKA-1517
>                 URL: https://issues.apache.org/jira/browse/TIKA-1517
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime
>    Affects Versions: 0.1-incubating, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 
> 0.10, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6
>            Reporter: Luke sh
>            Priority: Trivial
>         Attachments: BaysianTest.java
>
>
> Improvement and intuition
> The original implementation for MIME type selection/detection is a bit less 
> flexible by initial design, as it heavily relies on the outcome produced by 
> magic-bytes MIME Type identification; Thus e.g. if magic-bytes is applicable 
> in a file, Tika will follow the file type detected by magic-bytes. It may be 
> better to provide more control over the method of choice.
> This proposed approach slightly incorporate the Bayesian probability theorem, 
> where users are able to assign weights to each approach in terms of 
> probability, so they have the control over preference of which file type or 
> mime type identification methods implemented/available in Tika, and currently 
> there are 3 methods for identifying MIME type in Tika (i.e. Magic-Bytes, File 
> extension and Metadata content-type hint). By introducing some weights on the 
> approach in the proposed approach, users are able to choose which method they 
> trust most, the magic-bytes method is often trust-worthy though. But the 
> virtue is that in some situations, file type identification must be 
> sensitive, some might want all of the MIME type identification methods to 
> agree on the same file type before they start processing those files, 
> incorrect file type identification is less intolerable. The current 
> implementation seems to be less flexible for this purpose and heavily rely on 
> the Magic-bytes file identification method (although magic-bytes is most 
> reliable compared to the other 2 ); 
> Proposed design:
> The idea of selection is to incorporate probability as weights on each MIME 
> type identification method currently being implemented in Tika (they are 
> Magic bytes approach, file extension match and metadata content-type hint).
> for example,
> as an user, i would probably like to assign the the preference to the method 
> based on the degree of the trust, and order the results if they don't 
> coincide.
> Bayesian rule may be a bit appropriate here to meet the intuition.
> The following is what are needed for Bayesian rule implementation.
> > Prior probability P(file_type) e.g. P(pdf), theoretically this is computed 
> > based on the samples, and this depends on the domain or use cases, 
> > intuitively we more care the orders of the weights or probability of the 
> > results rather than the actual numbers, and also the context of Prior 
> > depends on samples for a particular use case or domain, e.g. if we happen 
> > to crawl a website that contains mostly the pdf files, we probably can 
> > collect some samples and compute the prior, based on the samples we can say 
> > 90% of docs are pdf, so our prior is defined to be P(pdf) = 0.9, but here 
> > we propose to define the prior as configurable param for users, and by 
> > default we leave the prior to be "unapplicable". Alternatively, we can 
> > define prior for each file type to be  1/[number of supported file types in 
> > Tika] I think the number would be approximately 1/1157 and using this 
> > number seems to be more fair, but the point of avoiding it is that this 
> > prior is fixed for every type, and eventually we care more the orders of 
> > the result and if the number is fixed, so will the order be, bringing this 
> > number of 1/1157 into the Bayesian equation will not only be unable to 
> > affect the order but also it will lumber our implementation with extra 
> > computation, thus we will leave it as "unapplicable" which means we assign 
> > 1 to it as it never exists! but note we care more the order rather the 
> > actual number, and this param is configurable, and we believe it provides 
> > much flexibilities in some use cases.
> > Conditional probability of positive tests given a file type P(test| 
> > file_type) e.g. P(test1 = pdf | pdf), this probability is also based on 
> > collection of samples and domain or use cases, we leave it configurable, 
> > but based on our intuition we think test1(i.e. Magic-bytes method) is most 
> > trustworthy, thus the default value is 0.75 for P(test1 = a_file_type | 
> > a_file_type), this is to say given the file whose type is "a file type", 
> > the probability of the test1 predicting the file is "a_file_type" is 0.75, 
> > that is really our intuition, as we trust test1 most, next we propose to 
> > use 0.7 for test3, and 0.65 for test2;
> (note again, test1 = magic-bytes, test2 = file extension, test3 = Metadata 
> Content-type hint)
> > Conditional probability of negative tests also need to be intuitively 
> > defined.
> E.g. By default, given a file type that is not pdf, the probability of test1 
> predicting it is pdf is 1-P(test1 = pdf | pdf), thus P(test1=pdf | ~pdf) = 1- 
> 0.75 = 0.25, as we trust the test1 the most, the other tests are defined with 
> 0.35 and 0.3 respectively with the same intuition.
>  
> >> The goal is to find out 
> P(file_type | test1 = file_type, test2=file_type, test3=file_type)
> (Please note, we are mostly interested in the order of choice rather than the 
> explicit computation, we selectively drop some of the parameters used in 
> Bayesian rule. Those are not considered will by default be set to 1 .)
> For example, given a file the 3 tests have predicted as follows
> test1 = pdf
> test2 = pdf
> test3 = pdf
> prior: P(pdf) = 1 and P(~pdf) = 1 (meaning they are not applicable )
> P(test1=pdf|pdf) = 0.75
> P(test2=pdf|pdf) =0.65
> P(test3=pdf|pdf) = 0.7
> With the same concept or intuition, we have the negative conditional 
> probability by default
> P(test1=pdf|~pdf) = 0.25
> P(test2=pdf|~pdf) =0.35
> P(test3=pdf|~pdf) = 0.3
> Then we ready to compute.
> Our goal is P(pdf|test1=pdf, test2=pdf, test3=pdf)
> P(pdf|test1=pdf, test2=pdf, test3=pdf) = [P(pdf) * P(test1=pdf|pdf) * 
> P(test2=pdf|pdf) * P(test3=pdf|pdf)]/total probability 
> where 
> total probability = P(pdf) * P(test1=pdf|pdf) * P(test2=pdf|pdf) * 
> P(test3=pdf|pdf) + P(~pdf) P(test1=pdf|~pdf) * P(test2=pdf|~pdf) * 
> P(test3=pdf|~pdf) = 0.3675
> Thus, 
> P(pdf|test1=pdf, test2=pdf, test3=pdf) = 0.92857
> ---------------------------------------------------------------------------------
> example 2
> test1=pdf
> test2=txt
> test3=txt
> In this example, test2 and test3 does not agree test1.
> So we have 2 types to compare, let's compute the 2 file type probabilities 
> with conditions on those test results.
> for simplicity, 
> test1=pdf, i will write test1+ 
> > pdf
> P(\+|test1+, test2-, test3-) 
> = [P(+)P(test1+|pdf)*P(test2-|pdf)*P(test3-|pdf)]/total probability
> = 0.40909
> >text
> P(\+|test1-, test2+, test3+)  = 0.590909
> ---------------------------------------------------------------------------------
> example 3
> test1=pdf
> test2=txt
> test3=uc
> > pdf 
> P(\+|test1+, test2-, test3-)  = 0.40909
> >txt
> P(\+|test1-, test2+, test3-) = 0.20968
> >nc
> P(\+|test1-, test2-, test3+)  = 0.29518
> ---------------------------------------------------------------------------------
> Since we are more interested in the weight order in a way we prefer to put 
> more weight on the methods with higher preference, we can further simplify 
> computation by ignoring the probability of the tests that have negative 
> prediction.
> Consider the example 3 above, 
> > pdf 
> P(\+|test1+, test2-, test3-)  = 0.40909
> we can ignore probability of test2- and test3-, as we are more interested in 
> the order of preference;
> the equation can be rewriten as follows
> P(\+|test1+) = 0.75/(0.75+0.35) = 0.75 where the total probability becomes 1, 
> note prior is set to 1 by default for simplicity too.
> Similarly, we have 
> >txt
> P(\+|test2+) = 0.65
> >uc
> P(\+|test3+)=0.7
> This follows the initial intuitive assumption and the intuitive order is also 
> preserved. 
> some of the parameters are being left out for computation simplicity by 
> default, but the goal is to provide a way thru which users are able to 
> control which method they want to use with probability weights, and this also 
> provides some rooms or flexibility for more MIME detection algorithms. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to