> Here I can understand the data.txt(training data set, mentioned above ) but
> can't understand what the file *en-doccat.bin *is and what is the content
> of the file.
>
en-doccat.bin is a model. Model is an outcome of training a classifier on a
dataset. One takes a dataset (data.txt in the linked example), runs it
through a training routine (DocumentCategorizerME.train in the linked
example) and gets a model file (en-doccat.bin in the linked example) as a
result. The one uses the model file with a classifier to classify new
sentences.
The format of the training data is as follows (see
DocumentSampleStream.java):
* Format:<br>
* Each line contains one sample document.<br>
* The category is the first string in the line followed by a tab and
whitespace separated document tokens.<br>
* Sample line: category-string tab-char whitespace-separated-tokens
line-break-char(s)<br>
In the name of the model ("en-doccat.bin"), prefix ("en") usually indicates
language (English in this case). Postfix ("doccat") usually means model
type (document "categorizer", that is a synonym of classifier). Extension
(".bin") usually indicates that this is a binary file.
Aliaksandr