Hello,
I am trying to get the train a model using data annotated with the brat
annotator. I have the annotation config and all the .txt and .ann files. I
have successfully trained a model using:
public TokenNameFinderModel trainModel(File corpusDir) throws
IOException{
//
// set up the directory structure of the corpus….
//
File trainingDir=new File(corpusDir,"train");
File testDir=new File(corpusDir,"test");
File config=new File(corpusDir,"annotation.conf");
//
// Create a NameSample Stream...
//
String[]
args={"-bratDataDir",trainingDir.getAbsolutePath(),"-annotationConfig",config.getAbsolutePath(),
"-ruleBasedTokenizer","simple"
};
ObjectStreamFactory<NameSample>
basFactory=StreamFactoryRegistry.getFactory(NameSample.class, "brat");
ObjectStream<NameSample> trainingStream=basFactory.create(args);
//
// Train the model...
//
TrainingParameters params = new TrainingParameters();
params.put(TrainingParameters.ITERATIONS_PARAM, "70");
params.put(TrainingParameters.CUTOFF_PARAM, "1");
TokenNameFinderModel nameFinderModel = NameFinderME.train("en",
null, trainingStream,
params, TokenNameFinderFactory.create(null,
null, Collections.emptyMap(), new BioCodec()));
NameFinderME nameFinder = new NameFinderME(nameFinderModel);
//
// Eval the model...
//
trainingStream.reset();
TokenNameFinderEvaluator evaluator=new
TokenNameFinderEvaluator(nameFinder, new NameEvaluationErrorListener());
evaluator=new TokenNameFinderEvaluator(nameFinder);
evaluator.evaluate(trainingStream);
System.out.println("on training
data\n"+evaluator.getFMeasure());
// return the model...
return nameFinderModel;
}
But, when I try to use the model…
public void eval(File dir,NameFinderME nameFinder) throws IOException{
// load the data..
FileFilter txtFileFilter=(File x) -> { return
x.getName().endsWith("txt") && x.length()>0; } ;
File[] files=dir.listFiles(txtFileFilter);
for (File responseFile:files){
// read the file into a string...
String response=readResponse(responseFile);
// break the string into sentences...
Span[]
sentenceSpans=sentenceDetector.sentPosDetect(response);
for (Span sentenceSpan:sentenceSpans){
String
sentence=sentenceSpan.getCoveredText(response).toString();
// break the sentences into tokens...
String[] tokens=tokenizer.tokenize(sentence);
// find the “names”..
Span[] spans=nameFinder.find(tokens);
int spanId = 0;
if (spans.length>0){
System.out.println(responseFile.getName());
for (Span span:spans){
Span offsetSpan=new
Span(span,sentenceSpan.getStart());
// print out the “names” found...
System.out.println(
"\tT"+(++spanId)+"\t"+offsetSpan.getStart()+"
"+offsetSpan.getEnd()+"\t"+offsetSpan.getCoveredText(response));
}
}
}
}
}
private String readResponse(File file) throws IOException{
StringBuilder sb=new StringBuilder();
try(BufferedReader in=new BufferedReader(new FileReader(file))){
sb.append(in.readLine());
}
return sb.toString();
}
I get spans that don’t line up with word boundary. So it is worse than wrong…
it’s nonsense. Clearly I am doing something wrong. Any idea?
Daniel