Hello!
At the bottom attached my working training solution, but my question is if
Im applying the model now via:
try (InputStream modelIn = new FileInputStream("de-clinics-drugs.bin")){
TokenNameFinderModel model = new TokenNameFinderModel(modelIn);
}
NameFinderME drugFinder = new NameFinderME(model);
// apply model to NER task
for (String document[][] : documents) {
for (String[] sentence : document) {
Span nameSpans[] = drugFinder.find(sentence);
// do something with the drug names
}
drugFinder.clearAdaptiveData()
}
>From the documentation:
"The descriptor file is stored inside the model after training and the
feature generators are configured correctly when the name finder is
instantiated."
Is the dictionary and the POSModel that I used for training also stored in
the trained model "de-clinics-drugs.bin"?
If not how can I pass this information to the NameFinderME?
I can use the model and it works but I am not sure if these two features
("tokenpos" and "dictionary" in the feature-XML) are properly loaded and
applied to the NER task.
Thank you!
lg Markus
---Features---
xml-content:
<generators>
<cache>
<generators>
<window prevLength = "2" nextLength = "2">
<tokenclass/>
</window>
<window prevLength = "2" nextLength = "2">
<token/>
</window>
<window prevLength = "2" nextLength = "2">
<tokenpattern/>
</window>
<window prevLength = "2" nextLength = "2">
<charngram min = "2" max = "5"/>
</window>
<window prevLength = "2" nextLength = "2">
<prefix length = "5"/>
</window>
<window prevLength = "2" nextLength = "2">
<suffix length = "5"/>
</window>
<window prevLength = "2" nextLength = "2">
<tokenpos model = "de-POS"/>
</window>
<window prevLength = "2" nextLength = "2">
<definition/>
</window>
<window prevLength = "2" nextLength = "2">
<prevmap/>
</window>
<window prevLength = "2" nextLength = "2">
<bigram/>
</window>
<sentence begin="true" end="true"/>
<dictionary dict="drugNames"/>
<postagger/>
</generators>
</cache>
</generators>
---Training---
Training:
InputStreamFactory in = new MarkableFileInputStreamFactory(new
File(pathToTrainingFiles));
ObjectStream<String> lineStream = new PlainTextByLineStream(in, "UTF-8");
ObjectStream<NameSample> sampleStream = new
NameSampleDataStream(lineStream);
TrainingParameters mlParams = new TrainingParameters();
mlParams.put(TrainingParameters.ITERATIONS_PARAM, Integer.toString(1000));
mlParams.put(TrainingParameters.CUTOFF_PARAM, Integer.toString(1));
// loading custom feature configuration from file
byte[] featuresFromXML = Files.readAllBytes(new
File(pathToFeatureDefinition).toPath());
// loading part of speech model (trained on german TIGER corpus)
InputStream modelIn = new FileInputStream(pathToPOSModel);
POSModel posModel = new POSModel(modelIn);
// loading dictionary resource
Dictionary drugDictionary = new Dictionary();
List<String> drugNames = FileUtils.readLines(new File(pathToDictionary),
"UTF-8");
for (String drugName : drugNames) {
drugDictionary.put(new StringList(drugName));
}
// filling resources object
Map<String, Object> resources = new HashMap<String, Object>();
resources.put("de-POS", posModel);
resources.put("drugNames", drugDictionary);
TokenNameFinderModel model;
TokenNameFinderFactory imiFactory = new
TokenNameFinderFactory(featuresFromXML, resources, new BioCodec());
try {
model = NameFinderME.train("de", "drug", sampleStream, mlParams,
imiFactory);
} finally {
sampleStream.close();
}
OutputStream modelOut = null;
try {
modelOut = new BufferedOutputStream(new
FileOutputStream("de-clinics-drugs.bin"));
model.serialize(modelOut);
} finally {
if (modelOut != null)
modelOut.close();
}