That looks good until the end. There you should use the tokens array to match the Spans returned from the NameFinderME.find method and not the sentence String. The returned index is based on on the input tokens array.
We have a method which can convert the spans back into Strings: Span.spansToStrings(Span[] spans, java.lang.String[] tokens) HTH, Jörn On Mon, May 22, 2017 at 8:20 PM, Dan Russ <[email protected]> wrote: > Hello, > I am trying to get the train a model using data annotated with the > brat annotator. I have the annotation config and all the .txt and .ann > files. I have successfully trained a model using: > > > public TokenNameFinderModel trainModel(File corpusDir) throws > IOException{ > // > // set up the directory structure of the corpus…. > // > File trainingDir=new File(corpusDir,"train"); > File testDir=new File(corpusDir,"test"); > File config=new File(corpusDir,"annotation.conf"); > > // > // Create a NameSample Stream... > // > String[] args={"-bratDataDir", > trainingDir.getAbsolutePath(),"-annotationConfig",config. > getAbsolutePath(), > "-ruleBasedTokenizer","simple" > }; > ObjectStreamFactory<NameSample> basFactory= > StreamFactoryRegistry.getFactory(NameSample.class, "brat"); > ObjectStream<NameSample> trainingStream=basFactory. > create(args); > > > // > // Train the model... > // > TrainingParameters params = new TrainingParameters(); > params.put(TrainingParameters.ITERATIONS_PARAM, "70"); > params.put(TrainingParameters.CUTOFF_PARAM, "1"); > TokenNameFinderModel nameFinderModel = > NameFinderME.train("en", null, trainingStream, > params, TokenNameFinderFactory.create(null, > null, Collections.emptyMap(), new BioCodec())); > > NameFinderME nameFinder = new > NameFinderME(nameFinderModel); > // > // Eval the model... > // > trainingStream.reset(); > TokenNameFinderEvaluator evaluator=new > TokenNameFinderEvaluator(nameFinder, new NameEvaluationErrorListener()); > evaluator=new TokenNameFinderEvaluator(nameFinder); > evaluator.evaluate(trainingStream); > System.out.println("on training > data\n"+evaluator.getFMeasure()); > // return the model... > return nameFinderModel; > } > > > > > But, when I try to use the model… > > > public void eval(File dir,NameFinderME nameFinder) throws > IOException{ > // load the data.. > FileFilter txtFileFilter=(File x) -> { return > x.getName().endsWith("txt") && x.length()>0; } ; > File[] files=dir.listFiles(txtFileFilter); > > for (File responseFile:files){ > // read the file into a string... > String response=readResponse(responseFile); > // break the string into sentences... > Span[] sentenceSpans=sentenceDetector. > sentPosDetect(response); > for (Span sentenceSpan:sentenceSpans){ > String sentence=sentenceSpan. > getCoveredText(response).toString(); > // break the sentences into tokens... > String[] tokens=tokenizer.tokenize( > sentence); > // find the “names”.. > Span[] spans=nameFinder.find(tokens); > int spanId = 0; > if (spans.length>0){ > System.out.println( > responseFile.getName()); > for (Span span:spans){ > Span offsetSpan=new > Span(span,sentenceSpan.getStart()); > // print out the “names” found... > System.out.println( > "\tT"+(++spanId)+"\t"+offsetSpan.getStart()+" "+offsetSpan.getEnd()+"\t"+ > offsetSpan.getCoveredText(response)); > } > } > } > } > } > > private String readResponse(File file) throws IOException{ > StringBuilder sb=new StringBuilder(); > try(BufferedReader in=new BufferedReader(new > FileReader(file))){ > sb.append(in.readLine()); > } > return sb.toString(); > > } > > > I get spans that don’t line up with word boundary. So it is worse than > wrong… it’s nonsense. Clearly I am doing something wrong. Any idea? > Daniel
