[ https://issues.apache.org/jira/browse/TIKA-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385177#comment-14385177 ]
ASF GitHub Bot commented on TIKA-1582: -------------------------------------- GitHub user LukeLiush opened a pull request: https://github.com/apache/tika/pull/36 Nn branch https://issues.apache.org/jira/browse/TIKA-1582 You can merge this pull request into a Git repository by running: $ git pull https://github.com/LukeLiush/tika nnBranch Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/36.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #36 ---- commit eb04f13260bfb5e4f4b0bf7fd54ecd085995cb92 Author: LukeLiush <hanson311...@gmail.com> Date: 2015-03-28T07:12:06Z https://issues.apache.org/jira/browse/TIKA-1582 commit acaf27bb666fdef05bdb18d7edcaafe7ccfd9bf5 Author: LukeLiush <hanson311...@gmail.com> Date: 2015-03-28T07:16:07Z move the comments of apache licence to the top commit 701fcc394ed2110e4c771fbb84999dca77932392 Author: LukeLiush <hanson311...@gmail.com> Date: 2015-03-28T07:19:43Z add some comments commit 12f290826a88cd977779bbf2e1a0385b315e73e3 Author: LukeLiush <hanson311...@gmail.com> Date: 2015-03-28T07:25:55Z move the example model file to the test resource directory commit 6c8d2e523c427380438f24d90985e28bfdbce050 Author: LukeLiush <hanson311...@gmail.com> Date: 2015-03-28T07:28:25Z remove empty comment block ---- > Mime Detection based on neural networks with Byte-frequency-histogram > ---------------------------------------------------------------------- > > Key: TIKA-1582 > URL: https://issues.apache.org/jira/browse/TIKA-1582 > Project: Tika > Issue Type: Improvement > Components: detector, mime > Affects Versions: 1.7 > Reporter: Luke sh > Priority: Trivial > > Content-based mime type detection is one of the popular approaches to detect > mime type, there are others based on file extension and magic numbers ; And > currently Tika has implemented 3 approaches in detecting mime types; > They are : > 1) file extensions > 2) magic numbers (the most trustworthy in tika) > 3) content-type(the header in the http response if present and available) > Content-based mime type detection however analyses the distribution of the > entire stream of bytes and find a similar pattern for the same type and build > a function that is able to group them into one or several classes so as to > classify and predict; It is believed this feature might broaden the usage of > Tika with a bit more security enforcement for mime type detection. Because we > want to build a model that is etched with the patterns it has seen, in some > situations we may not trust those types which have not been trained/learned > by the model. In some situations, magic numbers imbedded in the files can be > copied but the actual content could be a potentially detrimental Troy > program. By enforcing the trust on byte frequency patterns, we are able to > enhance the security of the detection. > The proposed content-based mime detection to be integrated into Tika is based > on the machine learning algorithm i.e. neural network with back-propagation. > The input: 0-255 bins each of which represent a byte, and and each of which > stores the count of occurrences for each byte, and the byte frequency > histograms are normalized to fall in the range between 0 and 1, they then are > passed to a companding function to enhancement the infrequent bytes. > The output of the neural network is a binary decision 1 or 0; > Notice BTW, the proposed feature will be implemented with GRB file type as > one example. > In this example, we build a model that is able to classify GRB file type from > non-GRB file types, notice the size of non-GRB files is huge and cannot be > easily defined, so there need to be as many negative training example as > possible to form this non-GRB types decision boundary. > The Neural networks is considered as two stage of processes. > Training and classification. > The training can be done in any programming language, in this feature > /research, the training of neural network is implemented in R and the source > can be found in my github repository i.e. > https://github.com/LukeLiush/filetypeDetection; i am also going to post a > document that describe the use of the program, the syntax/ format of the > input and output. > After training, we need to export the model and import it to Tika; in Tika, > we create a TrainedModelDetector that reads this model file with one or more > model parameters or several model files,so it can detect the mime types with > the model of those mime types. Details of the research and usage with this > proposed feature will be posted on my github shortly. > It is worth noting again that in this research we only worked out one model - > GRB as one example to demonstrate the use of this content-based mime > detection. One of the challenges again is that the non-GRB file types cannot > be clearly defined unless we feed our model with some example data for all of > the existing file types in the world, but this seems to be too utopian and a > bit less likely, so it is better that the set of class/types is given and > defined in advance to minimize the problem domain. > Another challenge is the size of the training data; even if we know the types > we want to classify, getting enough training data to form a model can be also > one of the main factors of success. In our example model, grb data are > collected from ftp://hydro1.sci.gsfc.nasa.gov/data/; and we find out that the > grb data from that source all exhibit a similar pattern, a simple neural > network structure is able to predict well, even a linear logistic regression > is able to do a good job; However, if we pass the GRB files collected from > other source to the model for prediction, then we find out that the model > predict poorly and unexpectedly, so this bring up the aspect of whether we > need to include all training data or those are of interest, including all > data is very expensive so it is necessary to introduce some domain knowledge > to minimize the problem domain; we believe users should know what types they > want to classify and they should be able to get enough training data, > although getting the training data can be a tedious and expensive process. > Again it is better to have that domain knowledge with the set of types > present in users' database and train a model with some examples for every > type in the database. -- This message was sent by Atlassian JIRA (v6.3.4#6332)