[jira] [Commented] (TIKA-1582) Mime Detection based on neural networks with Byte-frequency-histogram

ASF GitHub Bot (JIRA) Sat, 28 Mar 2015 00:30:07 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385177#comment-14385177
 ]


ASF GitHub Bot commented on TIKA-1582:
--------------------------------------

GitHub user LukeLiush opened a pull request:

    https://github.com/apache/tika/pull/36

    Nn branch

    https://issues.apache.org/jira/browse/TIKA-1582


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/LukeLiush/tika nnBranch

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/tika/pull/36.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #36
    
----
commit eb04f13260bfb5e4f4b0bf7fd54ecd085995cb92
Author: LukeLiush <[email protected]>
Date:   2015-03-28T07:12:06Z

    https://issues.apache.org/jira/browse/TIKA-1582

commit acaf27bb666fdef05bdb18d7edcaafe7ccfd9bf5
Author: LukeLiush <[email protected]>
Date:   2015-03-28T07:16:07Z

    move the comments of apache licence to the top

commit 701fcc394ed2110e4c771fbb84999dca77932392
Author: LukeLiush <[email protected]>
Date:   2015-03-28T07:19:43Z

    add some comments

commit 12f290826a88cd977779bbf2e1a0385b315e73e3
Author: LukeLiush <[email protected]>
Date:   2015-03-28T07:25:55Z

    move the example model file to the test resource directory

commit 6c8d2e523c427380438f24d90985e28bfdbce050
Author: LukeLiush <[email protected]>
Date:   2015-03-28T07:28:25Z

    remove empty comment block

----


> Mime Detection based on neural networks with Byte-frequency-histogram 
> ----------------------------------------------------------------------
>
>                 Key: TIKA-1582
>                 URL: https://issues.apache.org/jira/browse/TIKA-1582
>             Project: Tika
>          Issue Type: Improvement
>          Components: detector, mime
>    Affects Versions: 1.7
>            Reporter: Luke sh
>            Priority: Trivial
>
> Content-based mime type detection is one of the popular approaches to detect 
> mime type, there are others based on file extension and magic numbers ; And 
> currently Tika has implemented 3 approaches in detecting mime types; 
> They are :
> 1) file extensions
> 2) magic numbers (the most trustworthy in tika)
> 3) content-type(the header in the http response if present and available) 
> Content-based mime type detection however analyses the distribution of the 
> entire stream of bytes and find a similar pattern for the same type and build 
> a function that is able to group them into one or several classes so as to 
> classify and predict; It is believed this feature might broaden the usage of 
> Tika with a bit more security enforcement for mime type detection. Because we 
> want to build a model that is etched with the patterns it has seen, in some 
> situations we may not trust those types which have not been trained/learned 
> by the model. In some situations, magic numbers imbedded in the files can be 
> copied but the actual content could be a potentially detrimental Troy 
> program. By enforcing the trust on byte frequency patterns, we are able to 
> enhance the security of the detection.
> The proposed content-based mime detection to be integrated into Tika is based 
> on the machine learning algorithm i.e. neural network with back-propagation. 
> The input: 0-255 bins each of which represent a byte, and and each of which 
> stores the count of occurrences for each byte, and the byte frequency 
> histograms are normalized to fall in the range between 0 and 1, they then are 
> passed to a companding function to enhancement the infrequent bytes.
> The output of the neural network is a binary decision 1 or 0;
> Notice BTW, the proposed feature will be implemented with GRB file type as 
> one example.
> In this example, we build a model that is able to classify GRB file type from 
> non-GRB file types, notice the size of non-GRB files is huge and cannot be 
> easily defined, so there need to be as many negative training example as 
> possible to form this non-GRB types decision boundary.
> The Neural networks is considered as two stage of processes.
> Training and classification.
> The training can be done in any programming language, in this feature 
> /research, the training of neural network is implemented in R and the source 
> can be found in my github repository i.e. 
> https://github.com/LukeLiush/filetypeDetection; i am also going to post a 
> document that describe the use of the program, the syntax/ format of the 
> input and output.
> After training, we need to export the model and import it to Tika; in Tika, 
> we create a TrainedModelDetector that reads this model file with one or more 
> model parameters or several model files,so it can detect the mime types with 
> the model of those mime types. Details of the research and usage with this 
> proposed feature will be posted on my github shortly.
> It is worth noting again that in this research we only worked out one model - 
> GRB as one example to demonstrate the use of this content-based mime 
> detection. One of the challenges again is that the non-GRB file types cannot 
> be clearly defined unless we feed our model with some example data for all of 
> the existing file types in the world, but this seems to be too utopian and a 
> bit less likely, so it is better that the set of class/types is given and 
> defined in advance to minimize the problem domain. 
> Another challenge is the size of the training data; even if we know the types 
> we want to classify, getting enough training data to form a model can be also 
> one of the main factors of success. In our example model, grb data are 
> collected from ftp://hydro1.sci.gsfc.nasa.gov/data/; and we find out that the 
> grb data from that source all exhibit a similar pattern, a simple neural 
> network structure is able to predict well, even a linear logistic regression 
> is able to do a good job; However, if we pass the GRB files collected from 
> other source to the model for prediction, then we find out that the model 
> predict poorly and unexpectedly, so this bring up the aspect of whether we 
> need to include all training data or those are of interest, including all 
> data is very expensive so it is necessary to introduce some domain knowledge 
> to minimize the problem domain; we believe users should know what types they 
> want to classify and they should be able to get enough training data, 
> although getting the training data can be a tedious and expensive process. 
> Again it is better to have that domain knowledge with the set of types 
> present in users' database and train a model with some examples for every 
> type in the database.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1582) Mime Detection based on neural networks with Byte-frequency-histogram

Reply via email to