[ https://issues.apache.org/jira/browse/TIKA-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14526532#comment-14526532 ]
Tim Allison commented on TIKA-1582: ----------------------------------- Sounds good. I'd be happy to document the current process on the Rackspace vm if you'd like to run anything there. My plate is a bit full this week and next. > Mime Detection based on neural networks with Byte-frequency-histogram > ---------------------------------------------------------------------- > > Key: TIKA-1582 > URL: https://issues.apache.org/jira/browse/TIKA-1582 > Project: Tika > Issue Type: Improvement > Components: detector, mime > Affects Versions: 1.7 > Reporter: Luke sh > Assignee: Chris A. Mattmann > Priority: Trivial > Labels: memex > Fix For: 1.9 > > Attachments: nnmodel.docx, week2-report-histogram comparison.docx, > week6 report.docx > > > Content-based mime type detection is one of the popular approaches to detect > mime type, there are others based on file extension and magic numbers ; And > currently Tika has implemented 3 approaches in detecting mime types; > They are : > 1) file extensions > 2) magic numbers (the most trustworthy in tika) > 3) content-type(the header in the http response if present and available) > Content-based mime type detection however analyses the distribution of the > entire stream of bytes and find a similar pattern for the same type and build > a function that is able to group them into one or several classes so as to > classify and predict; It is believed this feature might broaden the usage of > Tika with a bit more security enforcement for mime type detection. Because we > want to build a model that is etched with the patterns it has seen, in some > situations we may not trust those types which have not been trained/learned > by the model. In some situations, magic numbers imbedded in the files can be > copied but the actual content could be a potentially detrimental Troy > program. By enforcing the trust on byte frequency patterns, we are able to > enhance the security of the detection. > The proposed content-based mime detection to be integrated into Tika is based > on the machine learning algorithm i.e. neural network with back-propagation. > The input: 0-255 bins each of which represent a byte, and and each of which > stores the count of occurrences for each byte, and the byte frequency > histograms are normalized to fall in the range between 0 and 1, they then are > passed to a companding function to enhancement the infrequent bytes. > The output of the neural network is a binary decision 1 or 0; > Notice BTW, the proposed feature will be implemented with GRB file type as > one example. > In this example, we build a model that is able to classify GRB file type from > non-GRB file types, notice the size of non-GRB files is huge and cannot be > easily defined, so there need to be as many negative training example as > possible to form this non-GRB types decision boundary. > The Neural networks is considered as two stage of processes. > Training and classification. > The training can be done in any programming language, in this feature > /research, the training of neural network is implemented in R and the source > can be found in my github repository i.e. > https://github.com/LukeLiush/filetypeDetection; i am also going to post a > document that describe the use of the program, the syntax/ format of the > input and output. > After training, we need to export the model and import it to Tika; in Tika, > we create a TrainedModelDetector that reads this model file with one or more > model parameters or several model files,so it can detect the mime types with > the model of those mime types. Details of the research and usage with this > proposed feature will be posted on my github shortly. > It is worth noting again that in this research we only worked out one model - > GRB as one example to demonstrate the use of this content-based mime > detection. One of the challenges again is that the non-GRB file types cannot > be clearly defined unless we feed our model with some example data for all of > the existing file types in the world, but this seems to be too utopian and a > bit less likely, so it is better that the set of class/types is given and > defined in advance to minimize the problem domain. > Another challenge is the size of the training data; even if we know the types > we want to classify, getting enough training data to form a model can be also > one of the main factors of success. In our example model, grb data are > collected from ftp://hydro1.sci.gsfc.nasa.gov/data/; and we find out that the > grb data from that source all exhibit a similar pattern, a simple neural > network structure is able to predict well, even a linear logistic regression > is able to do a good job; However, if we pass the GRB files collected from > other source to the model for prediction, then we find out that the model > predict poorly and unexpectedly, so this bring up the aspect of whether we > need to include all training data or those are of interest, including all > data is very expensive so it is necessary to introduce some domain knowledge > to minimize the problem domain; we believe users should know what types they > want to classify and they should be able to get enough training data, > although getting the training data can be a tedious and expensive process. > Again it is better to have that domain knowledge with the set of types > present in users' database and train a model with some examples for every > type in the database. -- This message was sent by Atlassian JIRA (v6.3.4#6332)